Parsing of text encoding Ascii,gb2312,gbk,gb18030,unicode,ucs,utf

Source: Internet
Author: User

As we all know, a text from input to display to the storage is a fixed process, the process is: input code (according to the input method is different) → machine code (depending on the language environment and different, Different system language encoding is not the same) → Font code (depending on the font and different) → memory code (depending on the type of encoding to save different). What are the similarities and differences between different memory codes?

First, ASCII series encoding

First of all to explain the ASCII code (American standard Code For information interchange, the United States Standards Information Interchange Code), the era of this code is long, is by the United States National Standards Agency (ANSI) The most widely used character set and its encoding in the current computer. ASCII code has 7-bit code and 8-bit code in two forms. , which is used in English coding. ASCII code map:

It can be found that the highest bits are 0, thus wasting 128 bits of space, so there is extended ASCII code, such as:

Second, GB series code

Next look at the GB series encoding. The so-called GB, is the national standard. As gb2312-80, this code is the concept of a location code, each character is represented by two bytes. The so-called location code is the establishment of a standard, a large matrix, starting from 0. However, because some special coding in the computer has a special meaning, so is not directly using the location code in the computer to represent characters, so that there is a computer code, different areas of the computer code is different. For the domestic computer, the inside code is gb2312-80 (exactly said to show in GB18030). For Japan, it is shift_jis.

So what is the role of location code? This is another concept, that is the font code. Font code in the font is based on the location code of the text is determined, where is the font code stored? Yes, is the font file, each font file, in the form of location code to store their own font library. How to convert the location code into the inner code? First, two bytes, the range of each byte should be 0-255, but because the encoding may be mixed storage, in order to avoid the GB encoding and ASCII encoding confusion, it is stipulated that the gb2312-80 of the internal code two bytes of the highest bit is 1, so it is not confused with the ASCII code. While removing 32 control characters from the ASCII code as well as reserving space, the optional range becomes 0xa0~0xff (the range of BCD codes used as an international, accurate numeric calculation is 0x00~0x99, resulting in an optional range of 0x9a~0xff. At the same time in order to reserve a certain space to avoid conflict, the optional range should be between the 0xa0~0xff, specifically which to be verified. As a habit to use, starting from 1, and 0xFF in the computer has the end meaning, especially in the C language, so discard. So the final scope is 0xa1~0xfe, that is, 94x94.

it turns out to be this: gb2312-80 a total of 7,445 Chinese characters and graphic symbols, including 6,763 Chinese characters , divided into two levels, 3,755 Chinese characters, two Chinese characters 3,008. The symbols of Chinese characters are divided into 94 "zones" according to their location , each containing 94 characters of Chinese characters, and each Chinese character is also called a "bit ". The inner code range of the Chinese character area is high byte from B0-f7, low byte from A1-fe, occupy code bit is 72*94=6768. 5 of these seats are d7fa-d7fe. < Span lang= "en-US" > 01 to

< Span lang= "en-US" > < Span lang= "en-us" > So in memory, how is the text stored? My guess is that in-memory is stored in the current system's internal code, regardless of the file's storage encoding Unicode or otherwise, in memory will be converted to internal code, because the font file is displayed according to the code inside the computer.

above only said gb2312-80, in fact, now the standard has evolved, through the evolution of gb2312-80→gbk→gb18030, the latest GB18030, included 27,484 Chinese characters, but also included Tibetan, Mongolian, Uyghur and other major minority characters. is a four-byte variable-length encoding. The GBK is a double-byte encoding. As for the 94 limit mentioned above, where the double byte first byte is used, the highest bit is 1, the second byte is the highest bit unrestricted representation, the storage area is opened one times, so that the storage is not affected, and the utilization of space is improved. Please also refer to the main differences in the gb2312,gbk,gb18030 of these types of characters.

Third, Unicode universal code

Unicode code, also known as the Universal Code, unified Code, a single code. By looking at the name, he unifies all the word encodings of the world, created by the Unicode organization. The rapid development of computer network, in order to solve different regions in different countries of the text stored in the display problem, the universal code emerged. But in fact, Unicode only stipulates the encoding of each text, and does not specify the encoding of the storage mode, which is a bit like the above-mentioned location code. And there are many encodings in Unicode that are reserved by the system, such as 0XFF, which also appears in Unicode. This can be a lot of problems in storage, such as a lot of word processing system, the 0xFF as the end sign. Therefore, it is not possible to encode directly using Unicode. So there is a series of implementation options, namely UTF series encoding.

The UTF series includes utf-8,utf-16,utf-32. UTF-8 is a four-byte variable long character encoding, the encoding scheme is: the first n bits of the initial byte are 1,n the number of bytes occupied by the character encoding, and after the last 1 followed by a 0, the rest of each byte of the first two bits are 10, while the remaining bits are used to implement Unicode. That

Unicode encoding (hexadecimal) UTF-8 byte stream (binary)
000000-00007f 0xxxxxxx
000080-0007ff 110xxxxx 10xxxxxx
000800-00ffff 1110xxxx 10xxxxxx 10xxxxxx
010000-10ffff 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

The UTF-16 is a double-byte representation, while the UTF-32 is four bytes, they all have the difference between the head and the small one, and the BOM (Byte Order mark) is labeled as follows:

UTF encoding Byte Order Mark (BOM)
UTF-8 without BOM No
UTF-8 with BOM EF BB BF
Utf-16le FF FE
Utf-16be FE FF
Utf-32le FF FE 00 00
Utf-32be XX-FE FF

Specific implementation plan reference Baidu encyclopedia.  

But this is a little bit strange, when we save the text, we can see that there is a Unicode option, then which of the above Unicode? Actually is not, he generally refers to UCS-2 (in fact, is not very accurate, because UCS-2 is a subset of UTF-16, in fact, it can be said that Unicode is UTF-16), what is UCS-2? Then look down.

Four, UCS series encoding

UCS is universal Multiple-octet Coded Character Set, unified character encoding, is an encoding scheme similar to Unicode, or can be said to be exactly the same? As for why, because UCS and Unicode are created by two different organizations, UCS is created by ISO (World labeling) organization. But after a while, both sides realised that the world didn't need two completely incompatible but functionally identical encodings, so they reached a consensus. He has a ucs-2,ucs-4 of two implementations. UCS-2 uses two bytes to represent characters, and usually when we save the file, we choose to save it as Unicode, which is actually saved as UCS-2 encoding. UCS-4 uses four bytes to hold one character. Since both codes have reached consensus, UCS-2 can be seen as a subset of UTF-16. UCS-4 is a subset of the UTF-32.

At present, Unicode is popular, and UTF-8 Big line, the UCS code is basically equated to utf-16,utf-32, so the UCS basically fade out of people's horizons. (UCS-2 is used for Windows NT)

V. Other codes

We also have a common code called ANSI. This is actually just a code page pointer that indicates what encoding the current system uses, and in the Chinese system, the pointer is pointing to GB18030. Above these codes are we often come into contact with, there are some not commonly used, such as BIG5 code, traditional Chinese code. UTF-7, more for mail transfer, and more.

So what about compatibility among so many different character sets?

Almost all encodings are compatible with ASCII codes, and GB series codes are high compatible with low. UTF series are incompatible. GB is incompatible with UTF. BIG5 is incompatible with others. The rest of the summary, temporarily only think of so much.

Vi. Supplementary Knowledge

About URL encoding: the content submitted by the form is converted according to the character set CharSet in ContentType, and the English is converted directly to%ascii code. For Chinese or other text, it is first converted to decimal character encoding (character encoding for the set of character sets), the converted form is & #XXXXX, and then the converted encoding URL encoding, that is, converted to%ascii code form, sent to the client. This is the encoding of the Get method for submitting content using a URL.

For example:%26%2321834%3b%26%2321834%3b%26%2321834%3b%26%2321834%3b, after using URL decoding for:& #21834;& #21834;& #21834; & #21834; that's UTF-8. Decimal code

Seven, more content

For more information on the implementation and history of coding, please refer to the corresponding wiki encyclopedia. Here are a few blog posts that helped me a lot:

The difference between ANSI,UTF8,UNICODE,ASCII encoding

  Character-coded notes: Ascii,unicode and UTF-8

Parsing of text encoding Ascii,gb2312,gbk,gb18030,unicode,ucs,utf

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.