Document directory
- Unicode Character Set Overview
- Encoding System Changes
- Common unicode encoding
- Unicode-Related Frequently Asked Questions
Original article: http://www.cnblogs.com/KevinYang/archive/2010/06/18/1760597.html
Character encoding seems to be very small and often ignored by technical staff, but it can easily lead to some inexplicable problems. Here we will summarize some of the popular knowledge about character encoding and hope to help you.
I still have to start with the ASCII code.
Speaking of character encoding, we have to say a brief history of ASCII code. Computers were first invented to solve the problem of digital computing. Later, people found that computers can do more things, such as text processing. However, because a computer only recognizes "Number", people must tell the computer which number represents a specific character. For example, 65 represents the letter 'A', 66 represents the letter 'B', and so on. HoweverThe correspondence between characters and numbers between computers must be consistent; otherwise, the characters displayed for the same number on different computers may be different.. Therefore, the American National Standards Association ANSI has set a standard that specifies a set of common characters and the numbers corresponding to each character. This is the ASCII character set, also known as the ASCII code.
At that time, computers generally used 8-bit bytes as the smallest storage and processing unit. In addition, there were very few characters used at that time, and 26 uppercase/lowercase English letters and numbers plus other commonly used symbols, because there are less than 100 ASCII codes, 7 bits can be used to efficiently store and process ASCII codes. The highest bit and 1 bits are used for the parity check of some communication systems.
Note: bytes represent the minimum unit that the system can process, not necessarily 8 bits. It is only the fact standard of modern computers that 8 bits are used to represent a byte. In many technical specification documents, to avoid ambiguity, we prefer to use octet instead of byte to emphasize the binary stream of 8 bits. For ease of understanding, I will extend the concept of byte.
The ASCII character set consists of 95 printable characters (0x20-0x7e) and 33 control characters (0x00-0x19, 0x7f. Printable characters are used to display on an output device, such as a screen or print paper. control characters are used to send some special commands to a computer. For example, 0x07 can make a computer beep, 0x00 is usually used to indicate the end of the string. 0x0d and 0x0a are used to indicate that the print needle of the printer returns to the beginning of the line (Press ENTER) and moves to the next line (line feed ).
At that time, the character encoding/decoding system was very simple, that is, a simple look-up process. For example, to write a character sequence encoded as a binary stream to a storage device, you only need to find the corresponding byte in the ASCII character set and then directly write the byte to the storage device. The process of decoding a binary stream is similar.
Derivative of OEM Character Set
When computers began to develop, people gradually found that the poor 128 characters in the ASCII character set could no longer meet their needs. People are thinking that one byte can represent 256 numbers (numbers), while ASCII characters only use 0x00 ~ 0x7f, that is, takes up the first 128 digits, and the last 128 digits do not need to be white. Therefore, many people have started the idea of the last 128 digits. But the problem is that many people have this idea at the same time, but they have their own ideas about the types of characters corresponding to the 128 numbers after 0x80-0xff. This led to the emergence of a large variety of OEM character sets on machines sold around the world.
The following table is one of the OEM character sets launched by the IBM-PC machine. The first 128 characters of the character set are basically the same as those of the ASCII character set (why are they basically the same, because the first 32 control characters are interpreted by the IBM-PC as printable characters in some cases), the subsequent 128 characters are added to the accent characters used by some European countries, and some characters used to draw lines.
In fact, most OEM character sets are compatible with ASCII character sets. That is to say ~ The interpretation of the 0x7 f range is basically the same, while for the second part, 0x80 ~ The explanations of 0xff are not necessarily the same. Sometimes, the corresponding bytes of the same character are different in different OEM character sets.
Different OEM character sets make it impossible for people to communicate documents across machines. For example, Employee A sent a resume Ré sumé s to employee B. As a result, employee B saw rsums, because the characters in the OEM Character Set of Employee A correspond to 0x82 characters, while in the OEM Character Set of employee B, the characters obtained after decoding 0x82 bytes are.
Multi-byte character set (MBCS) and Chinese Character Set
The character set we mentioned above is based on single-byte encoding, that is, one byte is translated into one character. This may not be a problem for Latin countries, because they can get 8th characters by extending 256 features. It is enough. However, for Asian countries, 256 characters are far from enough. Therefore, in order to use computers and maintain compatibility with ASCII character sets, people in these countries have invented the multi-byte encoding method. The corresponding character set is called the multi-byte character set. For example, China uses double byte character set encoding (DBCS, Double Byte Character Set ).
For a single-byte character set, only one code table is required in the code page. The code page contains the characters represented by 256 numbers. The process of coding and decoding can be completed by simply performing a look-up table.
The code page is the specific implementation of character set encoding. You can refer to it as a "character-Byte" ing table and use the table to translate "character-Byte. A more detailed description is provided below.
For multi-byte character sets, there are usually many code tables in the code page. So how does the program know which code table to use to decode the binary stream? The answer is,Select different code tables based on the first byte for resolution.
For example, currently the most commonly used Chinese Character Set gb2312 covers all Simplified characters and some other characters; GBK (k Represents the extended meaning) on the basis of gb2312, other non-Simplified characters such as complex characters are added (the gb18030 character set is not a dual-byte character set, which we will mention when talking about Unicode ). The characters in both character sets are displayed in 1-2 bytes. In Windows, the 936 code page is used to codec GBK character sets. When parsing a byte stream, if the maximum byte bit is 0, you can use the 936 code table in the 1st code page for decoding, which is consistent with the encoding/decoding method of the single-byte character set.
When the byte's high position is 1, to be exact, when the first byte is at 0x81
-0xFind the corresponding code table in the Code Page Based on the first byte. For example, if the first byte is 0x81, the following code table in section 936 is displayed:
(For complete code table information in the 936 code page, see msdn: http://msdn.microsoft.com/en-us/library/cc194913%28v=MSDN.10%29.aspx .)
According to the code table of the 936 code page, when the program encounters a continuous byte stream of 0x81 0x40, it will be decoded as the "bytes" character.
ANSI, national, and ISO standards
The emergence of different ASCII derivative character sets makes document communication very difficult, so various organizations have successively standardized the process. For example, the American ANSI organization has developed the ANSI standard character encoding (note,We usually talk about ANSI encoding, which usually refers to the default encoding of the platform, for example, ISO-8859-1 in the English operating system, the Chinese system is GBK), The ISO organization has developed various ISO standard character codes, and some countries will also develop some national standard character sets, such as China's GBK, gb2312 and gb18030.
During the release of the operating system, these standard character sets and platform-specific character sets are usually pre-installed on the machine, so that as long as your documents are compiled using standard character sets, the versatility is relatively high. For example, documents written in the gb2312 character set can be correctly displayed on any machine in mainland China. At the same time, we can also read documents from different languages in multiple countries on one machine, provided that the character set used for this document must be installed on the local machine.
Unicode
Although we can view documents in different languages on one machine by using different character sets, we still cannot solve one problem:Show all characters in one document. To solve this problem, we need a huge character set that all humans have agreed on. This is the Unicode Character Set.
Unicode Character Set Overview
The Unicode Character Set covers all the characters currently used by humans, and uniformly numbers each character to assign a unique encoding Code (Code point ). The Unicode Character Set divides all characters into 17 levels (plane) based on the Usage frequency. Each layer has 216 = 65536 character code spaces.
The 0th-level BMP basically covers all the characters used in today's world. Other aspects are either used to represent text in ancient times or reserved for extension. The Unicode characters we usually use are generally at the BMP level. Currently, a large amount of character space is not used in the Unicode Character Set.
Encoding System Changes
Before the emergence of Unicode, all character sets were bound with specific encoding schemes and were directly bound to the final byte stream, for example, the ASCII encoding system requires that the ASCII character set be encoded with 7 bits; The gb2312 and GBK character sets allow a maximum of 2 bytes to be used to encode all characters and specify the byte order. Such encoding systems usually use simple table query, that is, through the code page, you can directly map characters to the byte stream on the storage device. For example:
The disadvantage of this method is that character and throttling are too tightly coupled to limit the character set scalability. Assuming that the Martian will enter Earth in the future, it will be difficult or even impossible to add the Martian text to the existing character set, and it will easily damage the existing encoding rules.
Therefore, Unicode is designed to separate character sets and character encoding schemes.
That is to say,Although each character can find a unique serial number (UNICODE Code) in the Unicode Character Set, the final byte stream is determined by the specific character encoding.. For example, the Unicode Character "a" is also encoded, the byte stream produced by the UTF-8 character encoding is 0x41, and the UTF-16 (large-end mode) gets 0x00 0x41.
Common unicode encoding
UCS-2/UTF-16
How can we implement the BMP character encoding scheme in the Unicode Character Set? Because there are 216 = 65536 bytecode at the BMP level, we only need two bytes to fully represent all the characters.
For example, if the Unicode escape code of "medium" is 0x4e2d (01001110 00101101), we can encode it as 01001110 00101101 (Large End) or 00101101 01001110 (Small End ).
The UCS-2 and UTF-16 are expressed in two bytes for BMP-level characters and the encoded results are exactly the same. The difference is that,Originally designed with only BMP characters in mind, ucs-2 uses a fixed 2-byte length, that is, it cannot represent other Unicode-level characters, and the UTF-16 to remove this limit, supports encoding and decoding of UNICODE character sets. It adopts variable-length encoding and uses at least two bytes. If you want to encode a character other than BMP, You need to pair it with four bytes., Here is not discussed so far, interested can refer to Wikipedia: UTF-16/UCS-2.
Windows from the NT era began to adopt UTF-16 encoding, many popular programming platforms, such as. net, Java, QT and Mac under the cocoa are using UTF-16 as the basis of character encoding. For example, the character string in the Code, the corresponding byte stream in the memory is encoded with the UTF-16.
UTF-8
UTF-8 should be the most widely used unicode encoding scheme. Because the UCS-2/UTF-16 uses two bytes for ASCII character encoding, storage and processing efficiency is relatively low, and because the ASCII character after UTF-16 encoding, the height of the byte is always 0x00. Many C-language functions regard this Byte as the end of the string, leading to the failure to parse the text correctly. Therefore, at the beginning of the launch, many Western countries were in conflict, greatly affecting the implementation of Unicode. Later, intelligent people invented the UTF-8 code to solve this problem.
UTF-8 encoding scheme uses 1-4 bytes to encode characters, the method is also very simple.
(X represents the low 8-bit Unicode code, and Y represents the High 8-bit)
For ASCII character encoding using a single byte, and ASCII encoding touch the same way, so that all the original use of ASCII codec documentation can be directly transferred to the UTF-8 encoding. For other characters, it is expressed by 2-4 bytes. The number of the first byte before 1 indicates the number of bytes required for correct analysis, and the remaining byte height is always 10. For example, if the first byte is 1110 yyyy and the first byte is 3 bytes, it indicates that the correct parsing requires 3 bytes in total. It must be combined with the second byte starting with 10 to parse the characters correctly..
For more information about UTF-8, see Wikipedia: UTF-8.
Gb18030
Any encoding that maps Unicode characters to byte streams is Unicode encoding. The Chinese gb18030 encoding covers all Unicode characters, so it is also a unicode encoding. But his encoding method is not like the UTF-8 or UTF-16, Unicode Character number conversion through a certain rule, but only by querying the means of encoding.
For more information about gb18030, refer to: gb18030.
Unicode-Related Frequently Asked Questions
Is Unicode two bytes?
Unicode only defines a large universal character set and specifies a unique serial number for each character. The specific storage of the byte stream depends on the character encoding scheme. The recommended unicode encoding is UTF-16 and UTF-8.
What do signed UTF-8 mean?
Signature indicates that the byte stream starts with the BOM mark. Many software programs "intelligently" detect the character encoding used by the current byte stream. This detection process extracts several bytes before the byte stream for efficiency consideration, check whether the encoding rules of some common character codes are met. Since UTF-8 and ASCII encoding are the same for pure English encoding and cannot be distinguished, you can tell the software by adding the BOM mark at the beginning of the byte stream that Unicode encoding is currently used, the success rate is very accurate. However, it should be noted that not all software or programs can correctly process BOM tags. For example, PHP will not detect BOM tags and parse them as common byte streams. So if your php file is encoded using a UTF-8 with BOM tags, then problems may occur.
What is the difference between Unicode encoding and the previous character set encoding?
Early concepts such as character encoding, Character Set, and code page all express the same meaning. For example, the gb2312 Character Set, gb2312 encoding, and 936 code page are actually the same thing. But for Unicode is different, Unicode Character Set only defines the character set and unique number, Unicode encoding, is the UTF-8, UCS-2/UTF-16 and other specific encoding scheme collectively, it is not a specific encoding scheme. So when you need to use character encoding, you can write gb2312, codepage936, UTF-8, UTF-16, but please do not write Unicode (read others in the meta tag of the web page to write charset = Unicode, ).
Garbled Problem
Garbled characters indicate that the character text displayed by the program cannot be interpreted in any language. In general, there will be a large number? Or. The garbled problem occurs more or less for all computer users.The reason for garbled characters is that the incorrect character encoding is used to decode the byte stream.,Therefore, when thinking about any issues related to text display, always stay awake: What is the character encoding currently used?. Only in this way can we correctly analyze and handle Garbled text.
For example, the most common webpage garbled problem. If you are a website technician and have such problems, check the following reasons:
- The Content-Type Header returned by the server does not specify the character encoding.
- Whether the character encoding is specified using the meta HTTP-EQUIV tag in the webpage
- Whether the character encoding used when the webpage file is stored is consistent with the character encoding declared on the webpage
Note: If the character encoding used in the web page parsing process is incorrect, it may also cause a script or style table error. For details, refer to my previous article: script errors caused by document character sets and Encoding Problems on ASP. NET pages.
Not long ago, I saw a feedback from a technical forum. When the winform program uses the getdata method of the clipboard class to access the HTML content in the clipboard, there will be garbled questions, I guess it is also because winform does not use correct character encoding when obtaining HTML text. Windows clipboard only supports UTF-8 encoding, that is, the text you pass in will be UTF-8 codec. In this way, as long as both programs call the Windows clipboard API programming, there will be no garbled Characters During the copy and paste process. Unless a party decodes the Clipboard data using the wrong character encoding, it will get garbled characters (I did a simple winform clipboard programming experiment, getdata uses the system default encoding instead of the UTF-8 encoding ).
Are there any garbled characters? Or, you need to mention it here,When a program uses a specific character encoding to parse a byte stream, it will use it once it encounters a byte stream that cannot be parsed? Or. Therefore, once the final parsed text contains such characters and you cannot get the original byte stream, it means that the correct information has been completely lost, if you try to encode any character, you cannot restore the correct information from such character text..
Necessary Glossary
Character Set)Literally, it is a set of characters, such as the ASCII character set, which defines 128 characters; gb2312 defines 7445 characters. WhileThe character set mentioned in computer systems accurately refers to an ordered set of numbered characters (not necessarily consecutive).
Verification Code (Code point)It refers to the number of each character in the character set. For example, the ASCII character set uses the 128 consecutive numbers 0-128 to represent characters. The GBK character set uses the location code for each character number. First, it defines a matrix of 94x94, A row is called a "area" and a column is called a "bit". Then, all Chinese characters of the national standard are placed in a matrix, so that each Chinese character can be identified by a unique "location" code. For example, the "medium" character is placed in the 54-zone 48th bits, so the verification code is 5448. In Unicode, the character set is divided into 0 to a certain category ~ 16 among the 17 levels (PLANeS), each layer has 216 = 65536 encoding codes, so Unicode has a total of encoding codes, that is, the Unicode character space is 17*65536 = 1114112 in total.
EncodingThe process is to convert characters into word throttling.
DecodingThe process is to parse the byte stream into characters.
Character encoding)It is a specific implementation scheme that maps character codes in character sets to byte streams. For example, ASCII character encoding requires that all characters are encoded using seven low-bit characters in a single byte. For example, if the number of 'A' is 65 and expressed as 0x41 in a single byte, It is '123456' when written to the storage device '. GBK encoding is to add the offset of 0xa0 (160) to the code and the code in the GBK encoding (the reason for this offset, (for compatibility with ASCII codes). For example, in the mentioned "medium" text, the location code is 5448, And the hexadecimal value is 0x3630, after the offset of 0xa0 is added to the area code and bit code, 0xd6d0 is obtained. This is the GBK encoding result of the word "medium.
Code Page)A specific form of character encoding. There are relatively few early characters, so we usually map the characters directly into byte streams in a table-like form, and then implement character encoding and decoding through table checking. The modern operating system follows this method. For example, Windows uses the 936 code page, MAC system uses the EUC-CN code page to implement GBK character set encoding, although the name is different, but for the same Chinese character encoding is certainly the same.
Size endThe statement is derived from Gulliver Travel Notes. We know that the egg is usually one end and one end is small, and people in the minor countries have different opinions on which end should they start to peel the egg. Similarly, in the computer field, when transmitting multi-byte characters (multiple bytes are used to represent a data type), do you first transmit the High-byte bytes (large-end) or the low-byte bytes (small-end) there is also a different view, which is the reason for the large-and small-end mode in the computer. Both file writing and network transmission are actually writing to the stream device, in addition, this write operation starts from the low address of the stream to the high address (this is very suitable for people). For multi-byte words, if you write the high byte first, it is called the Big-end mode. The opposite is called the small-end mode. That is to say, in the big-end mode, the Location Order of the byte sequence is the opposite to that of the stream device, while in the small-end mode, the location order is the same. Generally, network protocols are transmitted in the big-end mode.
--Kevin Yang
Reference link:
- The absolute minimum every software developer absolutely, positively must know about Unicode and character sets (no excuses !)
- Http://developers.sun.com/dev/gadc/technicalpublications/articles/gb18030.html
- Http://en.wikipedia.org/wiki/Universal_Character_Set
- Http://en.wikipedia.org/wiki/Code_page