Computer coding history and differences

Source: Internet
Author: User

Gb2312:

A character smaller than 127 has the same meaning as the original character, but when two characters larger than 127 are connected, it indicates a Chinese character,
In this way, we can combine over 7000 simplified Chinese characters.
In these codes, we also compiled the mathematical symbols, the Greek letters in Rome, and the Japanese Kana,
The original numbers, punctuation marks, and letters in ASCII are all re-encoded in two bytes. This is often referred to as the "fullwidth" character,
Those earlier than 127 are called "halfwidth" characters.
One Chinese character is counted as two English characters! One Chinese character is counted as two English characters ......

GBK:

It is no longer required that the low byte must be the internal code after 127. As long as the first byte is greater than 127, it indicates that this is the beginning of a Chinese character,
Whether the content in the extended character set is followed or not.
The expanded encoding scheme is called the GBK standard,
GBK includes all the contents of gb2312, and adds nearly 20000 New Chinese characters (including traditional Chinese characters) and symbols.

Gb18030:

With thousands of new ethnic minorities added, GBK was expanded to gb18030.

Chinese programmers see that the Chinese character encoding standards are good, so they call them "DBCS" (double byte charecter set dubyte character set ).
In the DBCS series,
The biggest feature is that two-byte long Chinese characters and one byte long English character coexist in the same encoding scheme,
Therefore, the program they write must pay attention to the value of each byte in the string to support Chinese processing,
If the value is greater than 127, it is considered that a character in the Double Byte Character Set appears.

Every country has developed its own set of codes. For the sake of uniform use across the world, so there is...

The International Organization of ISO (International Standardization Organization) decided to solve the problem.
They plan to call it "Universal multiple-octet coded character set", or "Unicode" for short ".

UNICODE:
ISO directly specifies that two bytes, that is, 16 bits, must be used to uniformly represent all characters,
For the "halfwidth" characters in ASCII, the Unicode package remains unchanged in its original encoding, but its length is extended from the original 8-bit to 16-bit,
All characters in other cultures and languages are reencoded.
Because the "half-width" English symbols only need to use 8 lower digits, the height of 8 digits is always 0,
Therefore, this atmospheric solution wastes more than twice as much space when saving English text.

Their strlen function is unreliable. A Chinese character is no longer equivalent to two characters, but one!
Yes, starting from Unicode, whether it is a half-width English letter or a full-width Chinese character,
They are all unified "one character "! At the same time, both are unified "two bytes ",
"Byte" is an 8-bit physical storage unit, while "character" is a cultural symbol.
In Unicode, a character is two bytes.

Since Windows NT, The MS took the opportunity to change their operating system again,
All the core code has been changed to a version that works in Unicode mode,
From this moment on, the Windows system finally has no need to install a variety of local language systems, you can display all the characters of the world's culture.

However, Unicode is not considered to be compatible with any existing encoding scheme during development,
This makes GBK and Unicode completely different in the internal code orchestration of Chinese characters,
There is no simple arithmetic method to convert text content from unicode encoding to another encoding,
This type of conversion must be performed through the table.

Unicode is expressed as a character in two bytes. It can combine 65535 different characters in total,
This may already cover all cultural symbols in the world. If it is not enough, it doesn't matter, ISO has prepared a UCS-4 solution,
Simply put, four bytes are used to represent one character. In this example, we can combine 2.1 billion different characters (the highest bit has other purposes ),
This may be the day when the galaxy Federation was founded!

When Unicode came, it came along with the rise of computer networks. How Unicode is transmitted over the network is also a matter of consideration,
As a result, many UTF (uctransfer format) standards for transmission have emerged,
As the name implies, utf8 means to transmit data in eight places each time, while UTF16 means to transmit data in 16 places each time,
For the reliability of transmission, conversion from Unicode to UTF is not a direct correspondence, but requires some algorithms and rules.

A Conversion rule from Unicode to utf8 is drawn from the Internet:

Unicode
UTF-8

0000-007f
0 xxxxxxx

0080-07ff
110 XXXXX 10 xxxxxx

0800-FFFF
1110 XXXX 10 xxxxxx 10 xxxxxx

For example, the Unicode code of the Chinese character is 6c49. 6c49 is between 0800-ffff,
Therefore, we need to use a 3-byte template: 1110 XXXX 10 xxxxxx 10 xxxxxx.
Write 6c49 as binary: 0110 1100 0100 1001,
This bitstream is segmented by a three-byte Template into 0110 110001 001001,
Replace X in the template in sequence, and get: 1110-0110 10-110001,
That is, E6 B1 89, which is the UTF-8 encoding.

When a software opens a text, the first thing it needs to do is to decide which character set is used and which encoding is saved. The Software generally uses three methods to determine the character set and encoding of the text:
Check the file header ID and prompt the user to select and guess based on certain rules.
The most standard way is to detect the first few bytes of the text, the first byte charset/encoding, as shown in the following table:
Ef bb bf UTF-8
Fe FF UTF-16/UCS-2, little endian
FF Fe UTF-16/UCS-2, big endian
FF Fe 00 00 UTF-32/UCS-4, little endian.
00 00 Fe FF UTF-32/UCS-4, big-Endian.

ANSI character set: ASCII character set, and the derived and compatible character set,
For example, the official name of gb2312 is MBCS (Multi-byte chactacter system, multi-Byte Character System ),
It is also known as the ANSI character set.

Big endian and little endian
Big endian and little endian are different ways for CPUs to process the number of multi-word segments.
For example, the Unicode code of the Chinese character is 6c49. When writing to a file,
Whether to write 6C in front or 49 in front?
If you write 6C in front, it is big endian.
Write 49 in front, that is, little endian.

From ASCII, gb2312, GBK to gb18030, these encoding methods are backward compatible,
That is to say, the same character always has the same encoding in these schemes, and the following standard supports more characters.

The UTF-8 is coded in 8 bits.

The UTF-16 is encoded in 16 bits.
The UTF-16 code is equal to the 16-bit unsigned integer corresponding to the UCS code for a UCS code that is less than 0x10000.
An algorithm is defined for the UCS code not less than 0x10000.
However, because the actual use of ucs2 or the BMP of ucs4 must be smaller than 0x10000,
So for now, UTF-16 and UCS-2 are considered to be basically the same.
But the UCS-2 is only a coding scheme, the UTF-16 is used for actual transmission,
So we have to consider the issue of byte order.

I found that Unicode, Unicode big endian, and UTF-8-encoded TXT files start with several more bytes,
Are FF, Fe (UNICODE), Fe, FF (UNICODE big endian), EF, BB, BF (UTF-8 ).
But what standards are these tags based on?

That is, when reading files,
If it is Unicode, the first two bytes do not need to be read. Note: It is fixed to two bytes and one word.
If it is UTF-8, the first three bytes do not need to be read.
If it is ANSI and there are no additional bytes in front, it can be directly read from the file header.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.