Character-coded notes Ascii,unicode and utf-8_ other synthesis

Source: Internet
Author: User

1. ASCII code

We know that inside a computer, all information is ultimately represented as a binary string. Each bits (bit) has 0 and 12 states, so eight bits can combine 256 states, which is called Byte. That is, a byte can be used to represent 256 different states, each of which corresponds to a symbol, or 256 symbols, from 0000000 to 11111111.

In the 60 's, the United States developed a set of character coding, the relationship between English characters and bits, made a uniform provision. This is known as the ASCII code, which has been in use ever since.

ASCII code has a total of 128 characters encoded, such as the space "spaces" is 32 (binary 00100000), the upper case of the letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed) occupy only one byte of the back 7 digits, and the first 1-digit uniform is 0.

2, non-ASCII coding

It is enough to encode English with 128 symbols, but it is not enough to represent other languages and 128 symbols. For example, in French, there is a phonetic symbol above the letter, and it cannot be represented in ASCII code. As a result, some European countries decided to use the highest bits of unused bytes to incorporate new symbols. For example, E in French is encoded as 130 (binary 10000010). Thus, the coding system used in these European countries can represent up to 256 symbols.

But there are new problems. Different countries have different letters, so even if they all use the 256-symbol encoding, the letters represent the same. For example, 130 represents the E in French encoding, but it represents the letter Gimel (ג) in the Hebrew code and another symbol in the Russian code. But anyway, in all of these encodings, 0-127 represents the same symbol, and the difference is just 128-255.

As for Asian countries, the use of more symbols, Chinese characters on as much as 100,000. A byte can only represent 256 symbols, it is certainly not enough, you must use more than one byte to express a symbol. For example, the common encoding method for Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so it can theoretically represent 256x256=65536 symbols.

The issue of Chinese coding needs to be discussed in detail, which is not covered by this note. It only points out that although all are represented by a number of bytes, the encoding of the GB class has nothing to do with the Unicode and UTF-8 of the latter text.

3.Unicode

As mentioned in the previous section, there are many ways of coding in the world, and the same binary number can be interpreted as a different symbol. Therefore, if you want to open a text file, you must know how it is encoded, otherwise the wrong encoding way to interpret, will appear garbled. Why does email often appear garbled? It's because the sender and the recipient use different coding methods.

It can be imagined that if there is an encoding, all the symbols in the world are included. Each symbol is given a unique encoding, then the garbled problem will disappear. This is Unicode, as its name indicates, which is the encoding of all symbols.

Unicode, of course, is a large collection that can now accommodate 100多万个 symbols. The encoding of each symbol is different, for example, u+0639 represents the Arabic letter ain,u+0041 The English capital letter A,u+4e25 denotes the Chinese character "Yan". The specific symbol corresponding table, may inquire unicode.org, or the special Chinese character correspondence table.

4. The question of Unicode

It should be noted that Unicode is just a set of symbols that specify the binary code of the symbol, but do not specify how the binary should be stored.

For example, the Chinese character "Yan" Unicode is a hexadecimal number 4E25, converted into a binary number of a full 15 digits (100111000100101), that is, the symbol's representation requires at least 2 bytes. Represents another larger symbol, which may take 3 bytes or 4 bytes, or more.

There are two serious problems here, the first one is, how do you differentiate between Unicode and ASCII? How does a computer know that three bytes represent a symbol, rather than three symbols, respectively? The second problem is that we already know that the English alphabet is enough to represent only one byte, if Unicode unification stipulates that each symbol is represented by three or four bytes, then each English letter must have two to three bytes before it is 0, which is a great waste for storage, and the size of the text file will be two or three times times larger. , which is unacceptable.

The result is: 1 There are multiple ways of storing Unicode, which means there are many different binary formats that can be used to represent Unicode. 2 Unicode cannot be popularized for a long time until the advent of the Internet.

5.utf-8

The popularity of the internet, a strong demand for a unified coding mode. UTF-8 is the most widely used method of Unicode implementation on the Internet. Other implementations include UTF-16 and UTF-32, but not on the Internet. Again , the relationship here is that UTF-8 is one of the ways Unicode is implemented.

One of the biggest features of UTF-8 is that it is a variable length encoding. It can represent a symbol using the 1~4 byte, varying the byte length according to different symbols.

UTF-8 's coding rules are simple, only two:

1 for Single-byte symbols, the first bit of the byte is set to 0, followed by the 7-bit Unicode code for this symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.

2 for the N-byte symbol (N>1), the first n bits of a byte are set to 1, the n+1 bit is set to 0, and the first two digits of the following bytes are set to 10. The remaining bits, all of which are not mentioned, are all Unicode codes for this symbol.

The following table summarizes the encoding rules, and the letter x represents the bits that can be encoded.

Copy Code code as follows:

Unicode Symbol Range | UTF-8 Encoding method
(hex) | (binary system)
--------------------+---------------------------------------------
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Below, or the Chinese character "Yan" as an example, demonstrates how to implement UTF-8 encoding.

Known as "Strict" Unicode is 4E25 (100111000100101), according to the table above, you can find 4E25 in the range of the third row (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes, that is, the format is " 1110xxxx 10xxxxxx 10xxxxxx ". Then, start with the last bits of "Yan", then fill in the form from the back to the X, the extra bit to fill 0. This gets, "strict" UTF-8 code is "11100100 10111000 10100101", converted into 16 is e4b8a5.

6. Conversion between Unicode and UTF-8

From the example in the previous section, you can see that the "strict" Unicode code is the 4E25,UTF-8 encoding is E4B8A5, the two are not the same. The transitions between them can be implemented through the program.

Under the Windows platform, there is one of the simplest conversion methods, is to use the built-in Notepad applet Notepad.exe. When you open the file, click the Save As command on the File menu to jump out of a dialog box with a "coded" dropdown at the bottom.

There are four options: Ansi,unicode,unicode big endian and UTF-8.

1 ANSI is the default encoding method. For English files is ASCII encoding, for the Simplified Chinese file is GB2312 encoding (only for the Windows Simplified Chinese version, if the traditional Chinese version will use BIG5 code).

2 Unicode encoding refers to the UCS-2 encoding, that is, a Unicode code that is stored directly in two bytes of characters. This option is used in little endian format.

3 Unicode Big endian encoding corresponds to the previous option. I'll explain the meaning of little endian and big endian in the next section.

4) UTF-8 encoding, which is the encoding method mentioned in the previous section.

After the "encoding" is selected, click on the "Save" button, the file encoding method is immediately converted.

7. Little Endian and Big endian

As mentioned in the previous section, Unicode codes can be stored directly in UCS-2 format. In the case of Chinese character "Yan", the Unicode code is 4E25 and needs to be stored in two bytes, one byte is 4E and the other byte is 25. Storage time, 4E in front, 25 in the rear, is big endian way, 25 in the former, 4E in the back, is little endian way.

These two quirky names come from the English writer Swift's "Gulliver's Travels." In the book, the villain Country broke out the civil war, the cause of the war is people argue, eat eggs from the big Head (Big-endian) knock Open or from the head (Little-endian) knock Open. For this matter, six wars broke out before and after, one Emperor gave life and the other Emperor lost his throne.

Therefore, the first byte in front, is "big endian", the second byte in front is "small head mode" (Little endian).

So naturally, a problem arises: How does a computer know which way a file is encoded?

As defined in the Unicode specification, each file is preceded by a character that represents the encoding order, the name of which is called "0-width No-break space" (ZERO width), expressed in Feff. This is exactly two bytes, and FF is 1 larger than FE.

If the first two bytes of a text file are Fe FF, it means that the file is in the bulk, and if the first two bytes are FF Fe, it means that the file is in the small head mode.

8. Examples

Next, give an example.

Open Notepad program Notepad.exe, create a new text file, the content is a "strict" word, followed by Ansi,unicode,unicode big endian and UTF-8 encoding method to save.

Then, using the "hexadecimal feature" in the text-editing software UltraEdit, observe how the file is encoded internally.

1 ANSI: The encoding of the file is two bytes "D1 CF", which is the "strict" GB2312 encoding, which also implies that GB2312 is stored in the bulk way.

2 Unicode: The encoding is four bytes "ff fe 4E", where "FF Fe" indicates that the small head mode is stored, and the true encoding is 4E25.

3 Unicode Big endian: The encoding is four bytes "Fe FF 4E 25", wherein "FE FF" indicates that the bulk mode is stored.

4) UTF-8: Code is six bytes "EF BB bf E4 B8 A5", the first three bytes "EF BB bf" means that this is the UTF-8 code, after three "E4b8a5" is "strict" the specific encoding, its storage order and coding sequence is consistent.

9. Extended Reading

* The absolute Minimum Every Software Developer Absolutely, positively must Know about Unicode and Character Sets (about the base of the character set This knowledge)

* About Unicode encoding

* Rfc3629:utf-8, a transformation format of ISO 10646 (if UTF-8 requirements are implemented)

Finish

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.