Characters, bytes, and encoding

Source: Internet
Author: User
Tags coding standards

Directory

  • Historical Development:
    • Stage 1: ASCII character set and ASCII encoding.
    • Stage 2: ANSI encoding (localization)
    • Stage 3: Unicode (International)
    • We can use a tree chart to represent the branches of various character sets and codes developed from ASCII:
  • Detailed explanation:
    • I. ASCII code
    • Ii. Non-ASCII Encoding
    • Iii. Unicode
    • Iv. Unicode Problems
    • 5. UTF-8
    • 6. Conversion between Unicode and UTF-8
    • 7. little endian and big endian
    • 8. Instances
  • Refer:
Historical Development:

From the perspective of the development history of computer character encoding, there are three phases:

Stage 1: ASCII character set and ASCII encoding.

At the beginning, computers only support English (Latin characters). Other languages cannot be stored or displayed on computers. ASCII represents a character in seven bits of one byte, with the first position 0. Later, ASCII was extended to indicate more common European characters, and eascii was introduced. eascii represented a character in 8 bits so that it could represent more than 128 characters, some Western European characters are supported.

Stage 2: ANSI encoding (localization)

To enable the computer to support more languages, we usually use 0x80 ~ 2 bytes in the 0xff range to 1 character. For example, in the Chinese operating system, the byte [0xd6, 0xd0] is used for storage.
Different countries and regions have developed different standards, resulting in respective coding standards such as gb2312, big5, and JIS. These two bytes are used to represent the extended Chinese character encoding methods of a single character. They are called ANSI encoding. In a simplified Chinese system, ANSI encoding represents gb2312 encoding. In a Japanese operating system, ANSI encoding represents JIS encoding.
Different ANSI encodings are incompatible. When information is exchanged internationally, texts in two languages cannot be stored in the same ANSI encoded text.

Stage 3: Unicode (International)

To facilitate international information exchange, international organizations have developed UNICODE character sets and set a uniform and unique number for each character in various languages, to meet the requirements of cross-language and cross-platform text conversion and processing. Unicode has three common encoding methods: UTF-8 (1 byte representation), UTF-16 (2 byte representation), and UTF-32 (4 byte representation ).

We can use a tree chart to represent the branches of various character sets and codes developed from ASCII:

Detailed explanation: 1. ASCII code

We know that all information in the computer is eventually a binary value. Each binary bit has0And1Two States, so the eight binary bits can combine 256 states, which is called a byte ). That is to say, a single byte can be used to represent 256 different States. Each State corresponds to one symbol, that is, 256 symbols.00000000To11111111.

In the 1960s s, the United States developed a set of character codes to define the relationship between English characters and binary characters. This is called ASCII code, which has been used till now.

The ASCII code consists of a total of 128 characters, such as spaces.SPACEIs 32 (Binary00100000), Uppercase lettersAIs 65 (Binary01000001). These 128 symbols (including 32 control symbols that cannot be printed) only occupy the last seven digits of one byte.0.

Ii. Non-ASCII Encoding

It is enough to encode English with 128 symbols, but it is not enough to represent other languages. For example, if there is a phonetic symbol above a letter in French, it cannot be represented by ASCII code. As a result, some European countries decided to use the idle highest bit in the byte to encode the new symbol. For exampleéIs encoded as 130 (Binary10000010). In this way, the encoding systems used by these European countries can represent a maximum of 256 symbols.

However, there are new problems. Different countries have different letters. Therefore, even if they all use 256 characters, they represent different letters. For example, 130 representséIt represents letters in the Hebrew encoding.Gimel(?), Which represents another symbol in Russian encoding. However, in all these encoding methods, the 0--127 represents the same symbol, but the difference is only the 128--255 section.

As for Asian countries, more characters are used, and about 0.1 million Chinese characters are used. A single byte can only represent 256 types of symbols. It must be expressed by multiple bytes. For example, the common encoding method for simplified Chinese is gb2312, which uses two bytes to represent a Chinese character. Therefore, it can theoretically represent a maximum of 256x256 = 65536 characters.

The issue of Chinese encoding needs to be discussed in a specific article. This note does not cover this issue. It is only pointed out that although multiple bytes are used to represent a symbol, the Chinese character encoding of the GB class has nothing to do with the Unicode and UTF-8 of the subsequent text.

Iii. Unicode

As mentioned in the previous section, there are multiple encoding methods in the world. The same binary number can be interpreted as different symbols. Therefore, to open a text file, you must know its encoding method. Otherwise, garbled characters may occur when you use an incorrect encoding method. Why do emails often contain garbled characters? It is because the sender and receiver use different encoding methods.

As you can imagine, if there is an encoding, all the symbols in the world will be included. Every symbol is given a unique encoding, so the garbled problem will disappear. This is Unicode, as its names all represent. This is the encoding of all symbols.

Unicode is, of course, a large collection. The current size can contain more than 1 million characters. The encoding of each symbol is different, for example,U+0639Arabic lettersAin,U+0041Uppercase English lettersA,U+4E25Represents Chinese CharactersYan. You can query a specific symbol table at unicode.org or a special Chinese character table.

Iv. Unicode Problems

It should be noted that Unicode is only a collection of symbols. It only specifies the binary code of the symbol, but does not specify how the binary code should be stored.

For example, Chinese charactersYanUnicode is the hexadecimal number.4E25, There are 15 full bits (100111000100101), That is, the representation of this symbol requires at least two bytes. It indicates other larger symbols. It may take 3 or 4 bytes, or even more.

There are two serious problems here. The first problem is, how can we distinguish Unicode and ASCII? How does a computer know that three bytes represent one symbol instead of three symbols? The second problem is that we already know that only one byte is enough for English letters. If Unicode is uniformly defined, each symbol is represented by three or four bytes, therefore, there must be two to three bytes before each English letter0This is a huge waste for storage, and the size of text files will be two or three times larger, which is unacceptable.

The result is: 1) There are multiple Unicode storage methods, that is, there are many different binary formats that can be used to represent Unicode. 2) Unicode cannot be promoted for a long time until the emergence of the Internet.

5. UTF-8

With the popularity of the Internet, a unified encoding method is strongly required. UTF-8 is the most widely used Unicode implementation method on the Internet. Other implementations also include UTF-16 (characters are expressed in two or four bytes) and UTF-32 (characters are expressed in four bytes), but are basically not needed on the Internet.Repeat, the relationship here is that UTF-8 is one of the Unicode implementation methods.

The biggest feature of UTF-8 is that it is a variable length encoding method. It can use 1 ~ The four bytes indicate a symbol, and the length of the byte varies according to different symbols.

UTF-8 coding rules are very simple, only two:

1) for single-byte symbols, the first byte is set0, And the next 7 digits are the Unicode code of this symbol. Therefore, for English letters, the UTF-8 encoding and ASCII code are the same.

2)nByte symbol (n > 1), Before the first bytenBITs are set1, Non + 1Bit0, The first two digits of the next byte are all set10. The remaining unmentioned binary bits are all Unicode codes of this symbol.

The following table summarizes the encoding rules, lettersxIndicates the bit of the available encoding.

Unicode symbol range | UTF-8 encoding method (hexadecimal) | (Binary) california + California 0000 0000-0000 007f | 0xxxxxxx0000 0080-0000 07ff | 110 XXXXX 10xxxxxx0000 0800-0000 FFFF | 1110 XXXX 10 xxxxxx 10xxxxxx0001 0000-0010 FFFF | 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

According to the above table, the interpretation of UTF-8 encoding is very simple. If the first byte is0, The byte is a single character. If the first byte is1, The number of consecutive1Indicates the number of bytes occupied by the current character.

Below, we still use Chinese CharactersYanFor example, demonstrate how to implement UTF-8 encoding.

YanUnicode is4E25(100111000100101), According to the table above, you can find4E25Within the range of the third row (0000 0800 - 0000 FFFF), SoYanThe UTF-8 encoding of requires three bytes, that is, the format is1110xxxx 10xxxxxx 10xxxxxx. ThenYanStarting from the last binary bitx, More bits Complement0. In this way,YanThe UTF-8 code is11100100 10111000 10100101, Which is converted to hexadecimal formatE4B8A5.

6. Conversion between Unicode and UTF-8

The example in the previous section shows thatYanThe Unicode code of is4E25, UTF-8 code isE4B8A5The two are different. The conversion between them can be implemented through a program.

On Windows, the simplest conversion method is to use the built-in notepad applet.notepad.exe. After opening the file, clickFileIn the menuSaveCommand, a dialog box is displayed, withEncoding.

There are four options:ANSI,Unicode,Unicode big endianAndUTF-8.

1)ANSIIs the default encoding method. For English filesASCIIEncoding. For Simplified Chinese filesGB2312Encoding (only for Windows Simplified Chinese version, if it is a traditional Chinese version, it will use big5 code ).

2)UnicodeEncoding:notepad.exeThe UCS-2 encoding method used, that is, to store the Unicode code of characters directly in two bytes, this option uses the little endian format.

3)Unicode big endianEncoding corresponds to the previous option. In the next section, I will explain the meanings of little endian and big endian.

4)UTF-8Encoding, that is, the encoding method mentioned in the previous section.

After selecting "encoding method", click "save" to convert the file encoding method immediately.

7. little endian and big endian

Unicode codes can be stored in UCS-2 formats (no more0xFFFF). Take Chinese CharactersYanFor example, the Unicode code is4E25, Which must be stored in two bytes. one byte is4E, The other byte is25. During storage,4EIn front,25This is the big endian method;25In front,4ELater, this is the little endian method.

These two odd names are from the English writer Swift's gulliver Travel Notes. In this book, a civil war broke out in the country of small people. The reason for the war was people's debate about whether to break out from big-Endian or from Little-Endian when eating eggs. There were six wars in front and back for this purpose. One emperor gave his life and the other emperor lost his throne.

The first byte is the "Big endian", and the second byte is the "little endian ).

Naturally, a problem arises: how does a computer know which encoding method is used for a file?

Unicode standard definition. A character indicating the encoding sequence is added at the beginning of each file. The character is called "zero-width non-wrap space" (Zero Width no-break space ).FEFF. This is exactly two bytes, andFFRatioFELarge1.

If the first two bytes of a text file areFE FFIf the first two bytes areFF FEIndicates that the file adopts the Small Header mode.

8. Instances

The following is an example.

Open the Notepad programnotepad.exeCreates a text file with the contentYanWord, followedANSI,Unicode,Unicode big endianAndUTF-8Encoding method.

Then, use the "hexadecimal function" in the text editing software ultraedit to observe the internal encoding mode of the file.

1) ANSI: The file encoding is two bytes.D1 CF, Which is exactlyYanGb2312 encoding, which also implies that gb2312 is stored in the big-headed mode.

2) UNICODE: the encoding is four bytes.FF FE 25 4E, WhereFF FEIndicates that it is stored in the Small Header mode, and the actual encoding is4E25.

3) Unicode big endian: the encoding is four bytes.FE FF 4E 25, WhereFE FFIndicates that it is stored in the big data storage mode.

4) UTF-8: the encoding is six bytesEF BB BF E4 B8 A5, The first three bytesEF BB BFIndicates that this is a UTF-8 code, the last threeE4B8A5YesYanThe storage sequence is consistent with the encoding sequence.

Refer:

The length of Chinese Characters in two or three bytes

Characters, bytes, and encoding

Character Set and character encoding (charset & encoding)

Multi-byte encoding and Unicode code

Character encoding notes: ASCII, Unicode and UTF-8

Characters, bytes, and encoding

Related Article

E-Commerce Solutions

Leverage the same tools powering the Alibaba Ecosystem

Learn more >

Apsara Conference 2019

The Rise of Data Intelligence, September 25th - 27th, Hangzhou, China

Learn more >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.