From the perspective of the development history of computer character encoding, there are three phases:Stage 1: ASCII character set and ASCII encoding.
At the beginning, computers only support English (Latin characters). Other languages cannot be stored or displayed on computers. ASCII represents a character in seven bits of one byte, with the first position 0. Later, ASCII was extended to indicate more common European characters, and eascii was introduced. eascii represented a character in 8 bits so that it could represent more than 128 characters, some Western European characters are supported.Stage 2: ANSI encoding (localization)
To enable the computer to support more languages, we usually use 0x80 ~ 2 bytes in the 0xff range to 1 character. For example, in the Chinese operating system, the byte [0xd6, 0xd0] is used for storage.
Different countries and regions have developed different standards, resulting in respective coding standards such as gb2312, big5, and JIS. These two bytes are used to represent the extended Chinese character encoding methods of a single character. They are called ANSI encoding. In a simplified Chinese system, ANSI encoding represents gb2312 encoding. In a Japanese operating system, ANSI encoding represents JIS encoding.
Different ANSI encodings are incompatible. When information is exchanged internationally, texts in two languages cannot be stored in the same ANSI encoded text.
To facilitate international information exchange, international organizations have developed UNICODE character sets and set a uniform and unique number for each character in various languages, to meet the requirements of cross-language and cross-platform text conversion and processing. Unicode has three common encoding methods: UTF-8 (1 byte representation), UTF-16 (2 byte representation), and UTF-32 (4 byte representation ).We can use a tree chart to represent the branches of various character sets and codes developed from ASCII: Detailed explanation: 1. ASCII code
We know that all information in the computer is eventually a binary value. Each binary bit has
1Two States, so the eight binary bits can combine 256 states, which is called a byte ). That is to say, a single byte can be used to represent 256 different States. Each State corresponds to one symbol, that is, 256 symbols.
In the 1960s s, the United States developed a set of character codes to define the relationship between English characters and binary characters. This is called ASCII code, which has been used till now.
The ASCII code consists of a total of 128 characters, such as spaces.
SPACEIs 32 (Binary
00100000), Uppercase letters
AIs 65 (Binary
01000001). These 128 symbols (including 32 control symbols that cannot be printed) only occupy the last seven digits of one byte.
It is enough to encode English with 128 symbols, but it is not enough to represent other languages. For example, if there is a phonetic symbol above a letter in French, it cannot be represented by ASCII code. As a result, some European countries decided to use the idle highest bit in the byte to encode the new symbol. For example
éIs encoded as 130 (Binary
10000010). In this way, the encoding systems used by these European countries can represent a maximum of 256 symbols.
However, there are new problems. Different countries have different letters. Therefore, even if they all use 256 characters, they represent different letters. For example, 130 represents
éIt represents letters in the Hebrew encoding.
?), Which represents another symbol in Russian encoding. However, in all these encoding methods, the 0--127 represents the same symbol, but the difference is only the 128--255 section.
As for Asian countries, more characters are used, and about 0.1 million Chinese characters are used. A single byte can only represent 256 types of symbols. It must be expressed by multiple bytes. For example, the common encoding method for simplified Chinese is gb2312, which uses two bytes to represent a Chinese character. Therefore, it can theoretically represent a maximum of 256x256 = 65536 characters.
The issue of Chinese encoding needs to be discussed in a specific article. This note does not cover this issue. It is only pointed out that although multiple bytes are used to represent a symbol, the Chinese character encoding of the GB class has nothing to do with the Unicode and UTF-8 of the subsequent text.Iii. Unicode
As mentioned in the previous section, there are multiple encoding methods in the world. The same binary number can be interpreted as different symbols. Therefore, to open a text file, you must know its encoding method. Otherwise, garbled characters may occur when you use an incorrect encoding method. Why do emails often contain garbled characters? It is because the sender and receiver use different encoding methods.
As you can imagine, if there is an encoding, all the symbols in the world will be included. Every symbol is given a unique encoding, so the garbled problem will disappear. This is Unicode, as its names all represent. This is the encoding of all symbols.
Unicode is, of course, a large collection. The current size can contain more than 1 million characters. The encoding of each symbol is different, for example,
U+0041Uppercase English letters
U+4E25Represents Chinese Characters
Yan. You can query a specific symbol table at unicode.org or a special Chinese character table.
It should be noted that Unicode is only a collection of symbols. It only specifies the binary code of the symbol, but does not specify how the binary code should be stored.
For example, Chinese characters
YanUnicode is the hexadecimal number.
4E25, There are 15 full bits (
100111000100101), That is, the representation of this symbol requires at least two bytes. It indicates other larger symbols. It may take 3 or 4 bytes, or even more.
There are two serious problems here. The first problem is, how can we distinguish Unicode and ASCII? How does a computer know that three bytes represent one symbol instead of three symbols? The second problem is that we already know that only one byte is enough for English letters. If Unicode is uniformly defined, each symbol is represented by three or four bytes, therefore, there must be two to three bytes before each English letter
0This is a huge waste for storage, and the size of text files will be two or three times larger, which is unacceptable.
The result is: 1) There are multiple Unicode storage methods, that is, there are many different binary formats that can be used to represent Unicode. 2) Unicode cannot be promoted for a long time until the emergence of the Internet.5. UTF-8
With the popularity of the Internet, a unified encoding method is strongly required. UTF-8 is the most widely used Unicode implementation method on the Internet. Other implementations also include UTF-16 (characters are expressed in two or four bytes) and UTF-32 (characters are expressed in four bytes), but are basically not needed on the Internet.Repeat, the relationship here is that UTF-8 is one of the Unicode implementation methods.
The biggest feature of UTF-8 is that it is a variable length encoding method. It can use 1 ~ The four bytes indicate a symbol, and the length of the byte varies according to different symbols.
UTF-8 coding rules are very simple, only two:
1) for single-byte symbols, the first byte is set
0, And the next 7 digits are the Unicode code of this symbol. Therefore, for English letters, the UTF-8 encoding and ASCII code are the same.
nByte symbol (
n > 1), Before the first byte
nBITs are set
n + 1Bit
0, The first two digits of the next byte are all set
10. The remaining unmentioned binary bits are all Unicode codes of this symbol.
The following table summarizes the encoding rules, letters
xIndicates the bit of the available encoding.
Unicode symbol range | UTF-8 encoding method (hexadecimal) | (Binary) california + California 0000 0000-0000 007f | 0xxxxxxx0000 0080-0000 07ff | 110 XXXXX 10xxxxxx0000 0800-0000 FFFF | 1110 XXXX 10 xxxxxx 10xxxxxx0001 0000-0010 FFFF | 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
According to the above table, the interpretation of UTF-8 encoding is very simple. If the first byte is
0, The byte is a single character. If the first byte is
1, The number of consecutive
1Indicates the number of bytes occupied by the current character.
Below, we still use Chinese Characters
YanFor example, demonstrate how to implement UTF-8 encoding.
6. Conversion between Unicode and UTF-8
100111000100101), According to the table above, you can find
4E25Within the range of the third row (
0000 0800 - 0000 FFFF), So
YanThe UTF-8 encoding of requires three bytes, that is, the format is
1110xxxx 10xxxxxx 10xxxxxx. Then
YanStarting from the last binary bit
x, More bits Complement
0. In this way,
YanThe UTF-8 code is
11100100 10111000 10100101, Which is converted to hexadecimal format
The example in the previous section shows that
YanThe Unicode code of is
4E25, UTF-8 code is
E4B8A5The two are different. The conversion between them can be implemented through a program.
On Windows, the simplest conversion method is to use the built-in notepad applet.
notepad.exe. After opening the file, click
FileIn the menu
SaveCommand, a dialog box is displayed, with
There are four options:
Unicode big endianAnd
ANSIIs the default encoding method. For English files
ASCIIEncoding. For Simplified Chinese files
GB2312Encoding (only for Windows Simplified Chinese version, if it is a traditional Chinese version, it will use big5 code ).
notepad.exeThe UCS-2 encoding method used, that is, to store the Unicode code of characters directly in two bytes, this option uses the little endian format.
Unicode big endianEncoding corresponds to the previous option. In the next section, I will explain the meanings of little endian and big endian.
UTF-8Encoding, that is, the encoding method mentioned in the previous section.
After selecting "encoding method", click "save" to convert the file encoding method immediately.7. little endian and big endian
Unicode codes can be stored in UCS-2 formats (no more
0xFFFF). Take Chinese Characters
YanFor example, the Unicode code is
4E25, Which must be stored in two bytes. one byte is
4E, The other byte is
25. During storage,
25This is the big endian method;
4ELater, this is the little endian method.
These two odd names are from the English writer Swift's gulliver Travel Notes. In this book, a civil war broke out in the country of small people. The reason for the war was people's debate about whether to break out from big-Endian or from Little-Endian when eating eggs. There were six wars in front and back for this purpose. One emperor gave his life and the other emperor lost his throne.
The first byte is the "Big endian", and the second byte is the "little endian ).
Naturally, a problem arises: how does a computer know which encoding method is used for a file?
Unicode standard definition. A character indicating the encoding sequence is added at the beginning of each file. The character is called "zero-width non-wrap space" (Zero Width no-break space ).
FEFF. This is exactly two bytes, and
If the first two bytes of a text file are
FE FFIf the first two bytes are
FF FEIndicates that the file adopts the Small Header mode.
The following is an example.
Open the Notepad program
notepad.exeCreates a text file with the content
Unicode big endianAnd
Then, use the "hexadecimal function" in the text editing software ultraedit to observe the internal encoding mode of the file.
1) ANSI: The file encoding is two bytes.
D1 CF, Which is exactly
YanGb2312 encoding, which also implies that gb2312 is stored in the big-headed mode.
2) UNICODE: the encoding is four bytes.
FF FE 25 4E, Where
FF FEIndicates that it is stored in the Small Header mode, and the actual encoding is
3) Unicode big endian: the encoding is four bytes.
FE FF 4E 25, Where
FE FFIndicates that it is stored in the big data storage mode.
4) UTF-8: the encoding is six bytes
EF BB BF E4 B8 A5, The first three bytes
EF BB BFIndicates that this is a UTF-8 code, the last three
YanThe storage sequence is consistent with the encoding sequence.
The length of Chinese Characters in two or three bytes
Characters, bytes, and encoding
Character Set and character encoding (charset & encoding)
Multi-byte encoding and Unicode code
Character encoding notes: ASCII, Unicode and UTF-8
Characters, bytes, and encoding