"Turn" about character encoding, all you need to know (ascii,unicode,utf-8,gb2312 ... )

Source: Internet
Author: User
Tags control characters printable characters

Reprint Address: http://www.imkevinyang.com/2010/06/%E5%85%B3%E4%BA%8E%E5%AD%97%E7%AC%A6%E7%BC%96%E7%A0%81%EF%BC%8C%E4%BD% A0%e6%89%80%e9%9c%80%e8%a6%81%e7%9f%a5%e9%81%93%e7%9a%84.html

The problem of character encoding seems to be very small, often overlooked by technical staff, but it can easily lead to some puzzling problems. Here is a summary of the character encoding of some of the popular knowledge, I hope to be helpful to everyone.

I still have to start with the ASCII code.

Speaking of character encoding, we have to say a brief history of ASCII code. Computers were invented at the outset to solve the problem of digital computing, and later people found that computers could do more things, such as text processing. But since computers only know "number", people have to tell the computer which number to represent which particular character, such as 65 for the letter ' A ', 66 for the letter ' B ', and so on. But the correspondence between the characters-numbers in the computer must be consistent, otherwise it will cause the same number to appear on different computers differently . The National Standards Association ANSI therefore sets a standard that specifies the set of characters commonly used characters and the number corresponding to each character, which is the ASCII character set (Character set), also known as ASCII code.

At that time, the computer generally used 8 bytes as the smallest storage and processing unit, combined with very few characters used at that time, 26 uppercase and lowercase letters and numbers plus other commonly used symbols, less than 100, so the use of 7 bits to efficiently store and process the ASCII code, The remaining top 1 bits are used as parity for some communication systems.

Note that the bytes represent the smallest unit that the system can handle, not necessarily 8 bits. Just the fact that modern computers are standard is to use 8 bits to represent a single byte. In many technical specifications, in order to avoid ambiguity, it is more likely to use the term 8-bit group (Octet) rather than byte (byte) to emphasize a 8-bit binary stream. For the sake of understanding, I will follow the concept of "byte" as we know it.

The ASCII character set consists of 95 printable characters (0x20-0x7E) and 33 control characters (0x00-0x19,0x7f). Printable characters are used to display on an output device, such as a screen or paper, and control characters are used to send special instructions to the computer, such as 0x07, which makes the computer beep, and 0x00 is typically used to indicate the end of the string, 0x0d, and The 0x0a is used to instruct the printer's print needle to fall back to the beginning of the line (carriage return) and move to the next line (newline).

At that time, the character codec system is very simple, is simple to look up the table process. For example, to encode a character sequence into a binary stream to a storage device, simply locate the byte corresponding to the character in the ASCII character set, and then write the byte directly to the storage device. The process of decoding a binary stream is similar.

Derivation of the OEM character set

As the computer began to develop, it was gradually discovered that the poor 128 characters in the ASCII character set could no longer satisfy their needs. People are thinking, a byte can represent the number (number) has 256, and ASCII characters only use the 0x00~0x7f, that is, occupy the first 128, the next 128 numbers do not need to white, so many people play the back of the idea of the 128 numbers. But the problem is that many people have this idea at the same time, but everyone has their own ideas about what character the 128 numbers behind 0x80-0xff are. This led to the emergence of a large variety of OEM character sets on machines that were sold all over the world.

The following table is one of the OEM character sets introduced by the IBM-PC machine, and the first 128 characters of the character set are basically consistent with the ASCII character set (why is it basically consistent because the first 32 control characters are interpreted in some cases by the ibm-pc machine as printable characters), The following 128 character spaces include accented characters used in some European countries, as well as some of the characters used to draw lines.

In fact, most of the OEM character sets are compatible with the ASCII character set, that is to say, the interpretation of 0x00~0x7f is basically the same, but for the second half of the 0x80~0xff interpretation is not necessarily the same. Even sometimes the same characters are different in the corresponding bytes in different OEM character sets.

Different OEM character sets prevent people from communicating various documents across machines. For example, the clerk sent a resume résumés to the staff B, the result of staff B saw is rsums, because the e character on the staff a machine in the OEM character set of the corresponding byte is 0x82, and on staff B's machine, because the use of different OEM character set, The characters that are obtained after decoding the 0x82 byte are.

Multibyte Character set (MBCS) and Chinese character set

The character sets we mentioned above are all based on single-byte encoding, that is, one byte is translated into one character. This may not be a problem for Latin countries because they can get 256 characters by expanding the 8th bit, which is enough. But for Asian countries, 256 characters are far from enough. Therefore, in order to use computers, and to maintain compatibility with the ASCII character set, the people in these countries invented multibyte encoding, and the corresponding character set is called the multibyte character set. For example, China is using double-byte character set encoding (dbcs,double byte Character set).

For a single-byte character set, the code page requires only a single code table, which records the characters represented by 256 digits. The program only needs to do a simple check table operation to complete the codec process.

Code page is the implementation of character set encoding, you can understand him as a "character-byte" mapping table, through the table to achieve "character-byte" translation. The following is a more detailed description.

In the case of multibyte character sets, code pages often have many code tables. So how does the program know which code table to use to decode the binary stream? The answer is to choose a different code table for parsing based on the first byte .

For example, the most commonly used Chinese character set GB2312, which covers all characters in simplified characters and some others, GBK (k for extension) adds to other non-simplified characters, such as traditional characters (GB18030 character set) on GB2312 basis. We will mention it when we speak Unicode). The characters for both character sets are represented using 1-2 bytes. The Windows system uses 936 code pages to encode and decode the GBK character set. When parsing a byte stream, if the highest bit of bytes is 0, it is decoded using the 1th table in the 936 code page, which is consistent with the encoding and decoding methods of the single-byte character set.

When the high point of the byte is 1, the exact point is when the first byte is located at 0x 81 –0xFE之间时,根据第一个字节不同找到代码页中的相应的码表,例如当第一个字节是0x81,那么对应936中的下面这张码表:

(see msdn:http://msdn.microsoft.com/en-us/library/cc194913%28v=msdn.10%29.aspx for the Complete Code table information in the 936 codepage.)

According to the code table of 936 codepage, when the program encounters continuous byte stream 0x81 0x40, it will decode to "丂" character.

ANSI standards, national standards, ISO standards

The emergence of different ASCII-derived character sets makes it very difficult to communicate documents, so organizations have standardized processes in succession. For example, the United States ANSI organization has developed ANSI standard character encoding (note that we now usually speak of ANSI encoding, usually refers to the platform's default encoding, such as the English operating system is iso-8859-1, Chinese system is GBK), ISO-standard character encodings, as well as a number of national standard character sets, such as gbk,gb2312 and GB18030 in China, are also developed.

When the operating system is released, these standard character sets are usually preloaded into the machine, and the platform-specific character set is used, so that as long as your documents are written in a standard character set, versatility is higher. For example, a document written in the GB2312 character set can be displayed correctly on any machine in mainland China. At the same time, we can read documents from multiple countries in different languages on a single machine, provided that the character set used by this document must be installed on this machine.

The appearance of Unicode

Although we can look up documents in different languages on a single machine by using different character sets, we still can't solve one problem: displaying all the characters in a document . In order to solve this problem, we need a huge character set, which is the Unicode character set, to reach a consensus among all mankind.

Overview of the Unicode character set

The Unicode character set covers all of the characters currently used by humans, and each character is uniformly numbered, assigning a unique character code (code point). The Unicode character set divides all characters into 17 levels (Plane) on a per-use basis, with 216 = 65,536 character code space at each level.

The No. 0 layer of BMP, basically covers all the characters used in the world today. Other dimensions are either used to denote some ancient words or to be extended. The Unicode characters we normally use are usually located on the BMP level. There is still a large number of character spaces in the Unicode character set that are not used.

Changes in the coding system

Before the advent of Unicode, all character sets were bound to the specific encoding scheme, and both characters and the final byte stream were directly bound to die, such as the ASCII encoding system, which provided for the use of 7 bits to encode the ASCII character set; GB2312 and the GBK character set, Limits the use of up to 2 bytes to encode all characters, and specifies the byte order. Such a coding system usually uses a simple look-up table, that is, the code page can directly map characters to the storage device byte stream. For example, the following example:

The disadvantage of this approach is that the character and byte stream are too tightly coupled to limit the ability of the character set to expand. Assuming that the Martians were to stay on earth, it would be difficult or impossible to add the Martian text to the existing character set, and it would be easy to break existing coding rules.

So Unicode is designed to take this into account, leaving the character set and character encoding scheme separated.

That is, although each character can find a uniquely determined number (character code, also known as a Unicode code) in the Unicode character set, the final byte stream is determined by the specific character encoding . For example, the Unicode character "A" is also encoded, UTF-8 character encoding to get the byte stream is 0x41, and UTF-16 (big-endian mode) is the 0x00 0x41.

Common Unicode encoding

Ucs-2/utf-16

What do we do if we want to implement encoding schemes for BMP characters in the Unicode character set? Since there are 216 = 65,536 character codes on the BMP level, we only need two bytes to fully represent all of the characters.

For example, the Unicode character code for "Medium" is 0x4e2d (01001110 00101101), then we can encode 01001110 00101101 (big endian) or 00101101 01001110 (small end).

UCS-2 and UTF-16 for BMP-level characters are represented by 2 bytes, and the encoding results are exactly the same. The difference is thatUCS-2 originally designed only to consider BMP characters, so using a fixed 2-byte length, that is, he could not represent Unicode on other levels of characters, and UTF-16 in order to remove this restriction, support Unicode full character set codec, Using a variable length encoding, a minimum of 2 bytes, if you want to encode characters other than BMP, you need 4 byte pairs , here is not discussed so far, interested can refer to Wikipedia: Utf-16/ucs-2.

Windows has been using UTF-16 encoding since the NT era, and many popular programming platforms, such as the. NET,JAVA,QT and Cocoa under the Mac, are all character encodings using UTF-16 as the basis. For example, the string in the code, the corresponding byte stream in memory is encoded with UTF-16.

UTF-8

UTF-8 should be one of the most widely used Unicode encoding schemes. Because Ucs-2/utf-16 uses two bytes for ASCII characters, storage and processing efficiency is relatively inefficient, and because the ASCII characters are UTF-16 encoded after two bytes, the high byte is always 0x00, Many C-language functions treat this byte as the end of a string and cause the text to be parsed incorrectly. As a result, the introduction of the first time by many Western countries of the conflict, greatly affected the implementation of Unicode. Later, smart people invented the UTF-8 code to solve the problem.

The UTF-8 encoding scheme uses 1-4 bytes to encode characters, and the method is very simple.

(The x in represents the low 8 bits of the Unicode code, and Y represents the high 8 bits)

the encoding for ASCII characters uses a single byte, as with ASCII encoding, so that all documents that were originally encoded with ASCII can go directly to the UTF-8 encoding. For other characters, 2-4 bytes are used, where the number of the first byte preceding 1 represents the number of bytes required to parse correctly, and the high 2 bits of the remaining bytes are always 10. For example, the first byte is 1110yyyy, the front is 3 1, the correct parsing requires a total of 3 bytes, and the next 2 will be preceded by a byte of 10 to correctly parse the resulting characters .

For more information about UTF-8, refer to Wikipedia: UTF-8.

GB18030

Any encoding that is capable of mapping Unicode characters to a byte stream is Unicode encoded. China's GB18030 code, which covers all Unicode characters, is also considered a Unicode encoding. But his coding is not the same as UTF-8 or UTF-16, the number of Unicode characters is converted by a certain rule, and can only be encoded by means of table-checking.

For more information about GB18030, refer to: GB18030.

Common issues related to Unicode

Is Unicode two bytes?

Unicode simply defines a large, universally universal character set, and sets a unique number for each character, depending on the character encoding scheme, and what type of byte stream is stored. The recommended Unicode encoding is UTF-16 and UTF-8.

What does a signed UTF-8 mean?

With a signature, a byte stream begins with a BOM tag. Many software will "intelligently" detect the character encoding used by the current stream of bytes, a process that, for efficiency, usually extracts a byte stream in front of it to see if the encoding rules for some common character encodings are met. Since the UTF-8 and ASCII encoding is the same for plain English encoding, it cannot be distinguished, so by adding a BOM marker at the top of the byte stream, you can tell the software that the current Unicode encoding is used, and the success rate is very accurate. However, it is important to note that not all software or programs can correctly handle BOM tags, such as PHP will not detect the BOM mark, directly to it as the ordinary byte stream parsing. Therefore, if your PHP file is encoded with a BOM-tagged UTF-8, there may be a problem.

What is the difference between Unicode encoding and previous character set encoding?

Concepts such as early character encodings, character sets, and code pages all express the same meaning. For example, the GB2312 character set, the GB2312 encoding, the 936 code page, actually say the same thing. However, for Unicode, the Unicode character set only defines the set of characters and unique numbering, Unicode encoding, is the UTF-8, ucs-2/utf-16 and other specific coding schemes collectively, is not a specific coding scheme. So when you need to use the character encoding, you can write gb2312,codepage936,utf-8,utf-16, but please do not write Unicode (see others in the meta tag of the Web page to write Charset=unicode, feelings and hair).

Garbled problem

Garbled refers to the program display of the word rune can not be interpreted in any language. In general, it will contain a large number of or. Garbled problem is a problem that all computer users will encounter more or less. the reason for garbled characters is that we use the wrong character encoding to decode the stream of bytes , So when we think about any problem with text display, keep awake: What is the character encoding that is currently used . Only in this way, we can correctly analyze and deal with garbled problems.

For example, the most common Web page garbled problem. If you are a website technician, you will need to check for the following reasons if you encounter such a problem:

    • The response header Content-type returned by the server does not indicate a character encoding
    • Whether the character encoding is specified in the Web page using the META http-equiv tag
    • The character encoding used when the Web page file itself is stored and the character encoding of the page declaration is consistent

Note that the process of parsing a Web page can also cause errors in the script or style sheet if the character encoding used is incorrect. Specific details can be found in the articles I have written before: script errors caused by the document character set and encoding of the ASP.

Recently saw some technical forum feedback, WinForm program using the Clipboard class GetData method to access the Clipboard HTML content will appear garbled, I guess also because WinForm in the acquisition of HTML text is not used for the correct character encoding caused. The Windows Clipboard supports only UTF-8 encoding, meaning that the text you pass in will be UTF-8 encoded. This way, as long as two programs are called the Windows Clipboard API programming, then the copy and paste process will not appear garbled. Unless a party decodes using the wrong character encoding after acquiring the Clipboard data, it gets garbled (I did a simple WinForm Clipboard programming experiment and found that GetData was using the system default encoding instead of UTF-8 encoding).

What about garbled characters? Or, here you need to mention that when the program uses a specific character encoding to parse the byte stream, it will be used or replaced when it encounters an unresolved byte stream. Therefore, once you finally parse the resulting text containing such characters, and you cannot get the original byte stream, the correct information has been completely lost, try any character encoding cannot restore the correct information from such character text .

Explanation of the necessary terminology

The character set (Character set), which is literally understood as a collection of characters, such as the ASCII character set, defines 128 characters, and GB2312 defines 7,445 characters. and the character set mentioned in the computer system is accurate, which refers to the ordered set of numbered characters (not necessarily sequential).

The character code point is the numeric number of each character in the character set. For example, the ASCII character set uses 0-127 of the consecutive 128 digits to represent 128 characters; GBK character set uses location code for each character number, first defines a 94x94 matrix, the row is called "District", the column is called "bit", then all the national characters into the matrix, In this way, each character can be identified with a unique "location" code. For example, the word "medium" is placed in the 48th position in zone 54, so the character code is 5448. While Unicode divides the character set by a certain category into 0~16 17 levels (Planes), each level has 216 = 65,536 character codes, so Unicode has a total of character codes, that is, Unicode character space in total 17*65536= 1114112.

encoding is the process of converting characters into byte streams.

The decoding process is to parse the byte stream into characters.

character encoding (Character Encoding) is a specific implementation scheme for mapping character codes in character sets to byte streams. For example, ASCII character encoding specifies that all characters are encoded using 7 bits in a single-byte low. For example, the number of ' a ' is 65, and a single-byte representation is 0x41, so writing to the storage device is B ' 01000001 '. GBK encoding is the location code (GBK character code) in the region code and the Code of the 0XA0 (160) offset (the reason is to add such an offset, mainly to and ASCII code compatibility), such as the "medium" word just mentioned, the location code is 5448, Hex is 0x3630, The area code and the bit code respectively add 0xa0 's offset to get 0xd6d0, which is the GBK encoding result of the word "Zhong".

code page a specific form of character encoding. Early characters are relatively small, so it is common to map characters directly to a byte stream using a table-like form, and then implement the encoding and decoding of the characters by looking up the tables. This is the way the modern operating system continues. For example, Windows uses the 936 code page, the MAC system uses the EUC-CN code page to implement the GBK character set encoding, although the name is different, but for the same character encoding must be the same.

The story of the size end comes from Gulliver's Travels. We know that eggs are usually small at one end, and the people of the little country have a different view on which side of the shell should start stripping. Similarly, the computer community for the transmission of multi-byte words (by multiple bytes to jointly represent a data type), is the first high-order byte (big-endian) or low-level bytes (small end) also have a different view, this is the computer size end mode of origin. Whether writing a file or a network transmission, is actually the process of writing to a streaming device, and this write is written from the low address of the stream to the high address (which is quite in line with the human habit), for multibyte words, if the high byte is written first, it is called the big-endian mode. The reverse is called a small-end mode. That is, in the big-endian mode, the order of the byte order and the flow device is reversed, while the small-end mode is the same. The General network protocol uses the big-endian mode to transmit, the Windows operating system uses the UTF-16 small terminal mode.

--Kevin Yang

Reference Links:

    • The Absolute Minimum every software Developer absolutely, positively must Know about Unicode and Character sets (No excuse s!)
    • Http://developers.sun.com/dev/gadc/technicalpublications/articles/gb18030.html
    • Http://en.wikipedia.org/wiki/Universal_Character_Set
    • Http://en.wikipedia.org/wiki/Code_page

"Turn" about character encoding, all you need to know (ascii,unicode,utf-8,gb2312 ... )

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.