Character Set and character encoding in 10 minutes
This document briefly describes the character set and character encoding concepts. And some common diagnostic techniques in case of garbled characters
Background: character sets and encoding are undoubtedly a headache for cainiao and even various experts. When encountering complicated character sets, various Mars text and garbled characters, it is often very difficult to locate the problem.
This article will give a brief introduction to character sets and encoding in terms of principles, at the same time, we will introduce some common troubleshooting methods for Garbled text so that readers can easily locate related problems in the future.
Before the formal introduction, let's make a small statement: If you want to understand the meaning of each term very accurately, you can refer to wikipedia. This article is an introduction after the bloggers understand and digest it and convert it into easy-to-understand and easy-to-understand expressions.
What Is Character Set
Before introducing character sets, we should first understand why character sets are required. What we see on the computer screen is the materialized text, while what we store in the computer storage media is actually a binary bit stream. Therefore, the conversion rules between the two require a unified standard. Otherwise, we will plug our USB flash drive into the boss's computer, and the documents will be garbled; the files uploaded by our friends QQ will, we opened it locally and garbled it. So in order to realize the conversion standard, various character set standards emerged. In short, the character set specifies the Conversion Relationship Between the binary number Storage Method (encoding) corresponding to a text and the binary value representing a text (Decoding.
So why are there so many character set standards? This question is actually very easy to answer. Ask yourself why our plug won't be available in the UK? Why are there so many interfaces on the display, such as DVI, VGA, HDMI, and DP? Many norms and standards do not realize at the beginning that they will be universally applicable in the future, or they want to distinguish themselves from existing standards in essence if they are in the interest of the Organization itself. As a result, there are so many standards that have the same effect but are not compatible with each other.
After talking about this, let's look at a practical example. The following is the hexadecimal and binary encoding results of the character in various encodings. Is there any embarrassing feeling?
Character Set |
Hexadecimal Encoding |
Corresponding binary data |
UTF-8 |
0xE5B18C |
1110 0101 1011 0001 1000 1100 |
UTF-16 |
0x5C4C |
1011 1000 1001 1000 |
GBK |
0x8CC5 |
1000 1100 1100 0101 |
What is character encoding?
Character Set is just the name of a rule set, corresponding to real life, character set is the name of a language. For example, English, Chinese, and Japanese. For a character set, correct encoding and Transcoding of a character requires three key elements: character repertoire, coded character set, and character encoding form ). The font table is a database equivalent to all readable or printable characters. The font table determines the range of all characters that the entire character set can display. Encoding character set, that is, a code point is used to indicate the position of a character in the font. Character encoding: converts the encoding character set to the actually stored value. Generally, the value of code point is stored directly as the encoded value. For example, in ASCII, A occupies 65th bits in the table, and the encoded value of A is 0100 0001, that is, the Binary Conversion Result of decimal 65.
Many readers may have the same question: the font table and the encoding Character Set seem to be essential. Since each character in the font table has its own serial number, simply use the serial number as the storage content. Why do we need to convert the serial number into another storage format through character encoding?
The reason is also easy to understand: the purpose of a unified font table is to cover all the characters in the world, however, in actual use, we will find that the actual characters used are relatively low compared to the entire font table. For example, programs in Chinese regions almost do not need Japanese characters, while some English countries even have simple ASCII font tables to meet basic requirements. If each character is stored with the serial number in the font table, each character requires three bytes (here, the Unicode font is used as an example ), in this way, it is obviously an additional cost (three times the size of storage) for English-speaking countries that originally used only one character in ASCII encoding ). You can use ASCII to store 1500 articles on the same hard disk, and use a 3-byte Unicode serial number to store only 500 articles. So there is a variable length code like UTF-8. In UTF-8 encoding, only one ASCII character is needed and only one byte is occupied. Complex characters such as Chinese and Japanese require two to three bytes for storage.
Relationship between UTF-8 and Unicode
After reading the above two concepts, it is easier to explain the relationship between UTF-8 and Unicode. Unicode is the encoding Character Set mentioned above, and UTF-8 is character encoding, that is, a form of Unicode rule library. With the development of the Internet, the requirements for the same font set become more and more urgent, and Unicode standards will naturally emerge. It covers almost the symbols and texts that may appear in different languages and will be numbered for them. For details, see Unicode on Wikipedia.
Unicode numbers start from 0000 to 10FFFF are divided into 16 Plane, each of which contains 65536 characters. While the UTF-8 only implements the first Plane, it can be seen that although the UTF-8 is the most widely accepted character set encoding, but it does not cover the entire Unicode font, this also makes it difficult to process special characters in some scenarios (as mentioned below ).
Introduction to UTF-8 Coding
In order to better understand the practical application, we briefly introduce the coding Implementation Method of UTF-8. That is, the relationship between the physical storage of the UTF-8 and the conversion of Unicode sequence numbers.
UTF-8 encoding is variable length encoding. The minimum encoding unit is one byte. The first 1-3 bits of a byte are descriptive parts, followed by actual sequence numbers.
- If the first byte is 0, it indicates that the current character is a single byte character, occupying one byte space. All parts after 0 (7 bits) represent the serial number in Unicode.
- If a byte starts with 110, it indicates that the current character is a double byte character, occupying 2 bytes of space. All parts (7 bits) After 110 represent the serial number in Unicode. The second byte starts with 10.
- If a byte starts with 1110, it indicates that the current character is a three-byte character, occupying 2 bytes of space. All parts (7 bits) After 110 represent the serial number in Unicode. The second and third bytes start with 10.
- If a byte starts with 10, the Current byte represents the second byte of Multi-byte characters. All parts after 10 (6 bits) represent the serial number in Unicode.
The specific features of each byte can be seen in the table below. x represents the serial number, and all the x parts of each byte are spliced together to form the serial number in the Unicode font library.
Byte 1 |
Byte 2 |
Byte3 |
0xxx xxxx |
|
|
110x xxxx |
10xx xxxx |
|
1110 xxxx |
10xx xxxx |
10xx xxxx |
Let's look at three examples of UTF-8 encoding from one byte to three byte:
Actual characters |
In the hexadecimal format of the Unicode font number |
Binary of the serial number in the Unicode font |
Binary after UTF-8 Encoding |
Hexadecimal after UTF-8 Encoding |
$ |
0024 |
010 0100 |
0010 0100 |
24 |
Bytes |
00A2 |
000 1010 0010 |
1100 0010 1010 0010 |
C2 A2 |
€ |
20AC |
0010 0000 1010 1100 |
1110 0010 1000 0010 1010 1100 |
E2 82 AC |
It is not difficult for careful readers to draw the following rules from the above brief introduction:
- The hexadecimal encoding of the Three-byte UTF-8 must start with E.
- The UTF-8 hexadecimal encoding of two bytes must start with C or D
- The hexadecimal encoding of a byte UTF-8 must start with a number smaller than 8.
Why is garbled?
Mojibake is an English native statement with garbled characters in popular science.
To put it simply, garbled characters appear because different or incompatible character sets are used for encoding and decoding. Corresponding to the real life, it is like an English who writes bless (Encoding Process) on paper to express blessings ). While a French got this paper, because bless in French represents injury, so he thinks he wants to express injury (decoding process ). This is a messy situation in real life. Like in computer science, a character encoded with a UTF-8 is decoded using GBK. Because the two character sets have different font tables and the same Chinese Character Set has different locations in the two character sets, garbled characters will eventually occur.
Let's look at an example: Suppose we use UTF-8 encoding to store very cute two characters, there will be the following conversion:
Character |
Hexadecimal after UTF-8 Encoding |
Very |
E5BE88 |
|
E5B18C |
So we get a numerical value such as E5BE88E5B18C. We use GBK decoding to display the display. We can obtain the following information through the query table:
Hexadecimal value of two bytes |
Character corresponding to GBK Decoding |
E5BE |
Atlas |
88E5 |
Bytes |
B18C |
Bytes |
After decoding, we get the result of such an error, even worse, the number of characters has changed.
How to identify Garbled text
To reverse the original correct text from garbled characters, you must have a deep understanding of the encoding rules of each character set. But the principle is very simple, here with the most common UTF-8 is wrong with GBK display garbled as an example, to illustrate the specific Reverse Solution and recognition process.
Step 2 encoding
Suppose we can see Garbled text like on the page, and we know that our browser is currently using GBK encoding. The first step is to encode garbled characters into binary expressions through GBK. Of course, the query table encoding efficiency is very low. We can also use the following SQL statement to directly use the MySQL client for coding:
- Mysql [localhost] {msandbox}> select hex (convert (' 'using gbk ));
- + ------------------------------------- +
- | Hex (convert (' 'using gbk) |
- + ------------------------------------- +
- | E5BE88E5B18C |
- + ------------------------------------- +
- 1 row inset (0.01 sec)
Step 2 Recognition
Now we get the decoded binary string E5BE88E5B18C. Then we split it by byte.
Byte 1 |
Byte 2 |
Byte 3 |
Byte 4 |
Byte 5 |
Byte 6 |
E5 |
BE |
88 |
E5 |
B1 |
8C |
Then apply the Rules summarized in the previous chapter of UTF-8 encoding, it is not difficult to find that the 6 bytes of data comply with the UTF-8 encoding rules. If the entire data stream meets this rule, we can boldly assume that the encoding character set before garbled characters is a UTF-8.
Step 2 Decoding
Then we can take E5BE88E5B18C Decoding with a UTF-8, view the text before garbled. Of course, we can directly obtain results through SQL without looking up the table:
- Mysql [localhost] {msandbox} (none)> select convert (0xE5BE88E5B18Cusing utf8 );
- + ------------------------------------ +
- | Convert (0xE5BE88E5B18Cusing utf8) |
- + ------------------------------------ +
- | Awesome |
- + ------------------------------------ +
- 1 row inset (0.00 sec)
Troubleshooting of common problems with Emoji
Emoji is a character in the \ u1F601-\ u1F64F segment of Unicode. This apparently exceeded the encoding range \ u0000-\ uFFFF of the commonly used UTF-8 character set. With the popularity and support of IOS, Emoji is becoming more and more common. Below are some common Emoji:
So what is the impact of Emoji on our daily development and O & M? The most common problem is that it is stored in the MySQL database. In general, the default Character Set of MySQL database is configured to UTF-8 (three bytes), and utf8mb4 is supported after 5.5, there are few DBA initiative to change the system default character set to utf8mb4. Then the problem comes. When we store a character that requires 4-byte UTF-8 encoding to be represented in the database, the ERROR: ERROR 1366: Incorrect string value: '\ xF0 \ x9D \ x8C \ x86' for column.
If you carefully read the above explanation, it is not difficult to understand this error. We try to insert a Bytes string into a column, and the first Bytes byte is \ xF0 which means it is a four-byte UTF-8 encoding. But when MySQL tables and column character sets are configured as UTF-8, such characters cannot be stored, so an error is reported.
How can we solve this problem? You can upgrade MySQL to MySQL 5.6 or later and switch the table character set to utf8mb4. The second method is to filter the content before it is stored in the database, replace the Emoji character with a special text encoding, and then store it in the database. Then, the special text encoding will be converted to Emoji display when it is obtained from the database or displayed on the front end. The second method is to replace the 4-byte Emoji with-*-1F601-*. For details about how to implement python code, see the answer on Stackoverflow.
This article permanently updates the link address: