Character Set and character encoding in 10 minutes

Last Update:2015-03-26 Source: Internet

Author: User

Tags printable characters

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Character Set and character encoding in 10 minutes

What Is Character Set

Before introducing character sets, we should first understand why character sets are required. What we see on the computer screen is the materialized text, while what we store in the computer storage media is actually a binary bit stream. The conversion rules between the two require a unified standard. Otherwise, the USB flash drive will be inserted into the boss's computer, and the documents will be garbled. The files uploaded by QQ, we opened it locally and garbled it. So in order to realize the conversion standard, various character set standards emerged. In short, the character set specifies the encoding of binary numbers corresponding to a text) and the conversion relationship between a string of binary numbers representing the decoding code.

So why are there so many character set standards? This question is actually very easy to answer. Ask yourself why our plug won't be available in the UK? Why are there so many interfaces on the display, such as DVI, VGA, HDMI, and DP? Many norms and standards do not realize at the beginning that they will be universally applicable in the future, or they want to distinguish themselves from existing standards in essence if they are in the interest of the Organization itself. As a result, there are so many standards that have the same effect but are not compatible with each other.

After talking about this, let's look at a practical example. The following is the hexadecimal and binary encoding results of the character in various encodings. Is there any embarrassing feeling?

Character Set	Hexadecimal Encoding	Corresponding binary data
UTF-8	0xE5B18C	1110 0101 1011 0001 1000 1100
UTF-16	0x5C4C	1011 1000 1001 1000
GBK	0x8CC5	1000 1100 1100 0101

What is character encoding?

Character Set is just the name of a rule set, corresponding to real life, character set is the name of a language. For example, English, Chinese, and Japanese. For a character set, correct encoding and transcoding requires three key elements: character set character repertoire, encoding character set coded character set, and character encoding form ). The font table is a database equivalent to all readable or printable characters. The font table determines the range of all characters that the entire character set can display. Encoding character set, that is, a code point is used to indicate the position of a character in the font. Character encoding: converts the encoding character set to the actually stored value. Generally, the value of code point is stored directly as the encoded value. For example, in ASCII, A occupies 65th bits in the table, and the encoded value of A is 0100 0001, that is, the Binary Conversion Result of decimal 65.

Many readers may have the same question: the font table and the encoding Character Set seem to be essential. Since each character in the font table has its own serial number, simply use the serial number as the storage content. Why do we need to convert the serial number into another storage format through character encoding?

The reason is also easy to understand: the purpose of a unified font table is to cover all the characters in the world, however, in actual use, we will find that the actual characters used are relatively low compared to the entire font table. For example, programs in Chinese regions almost do not need Japanese characters, while some English countries even have simple ASCII font tables to meet basic requirements. If each character is stored with the serial number in the font table, each character requires three bytes. Here, the Unicode font is used as an example ), in this way, it is obvious that an additional cost is three times the size of storage for English-speaking countries that originally use ASCII encoding only one character ). You can use ASCII to store 1500 articles on the same hard disk, and use a 3-byte Unicode serial number to store only 500 articles. So there is a variable length code like UTF-8. In UTF-8 encoding, only one ASCII character is needed and only one byte is occupied. Complex characters such as Chinese and Japanese require two to three character sections for storage.

Relationship between UTF-8 and Unicode

After reading the above two concepts, it is easier to explain the relationship between UTF-8 and Unicode. Unicode is the encoding Character Set mentioned above, and UTF-8 is character encoding, that is, a form of Unicode rule library. With the development of the Internet, the requirements for the same font set become more and more urgent, and Unicode standards will naturally emerge. It covers almost the symbols and texts that may appear in different languages and will be numbered for them. For details, see Unicode on Wikipedia.

Unicode numbers start from 0000 to 10FFFF are divided into 16 Plane, each of which contains 65536 characters. While the UTF-8 only implements the first Plane, it can be seen that although the UTF-8 is the most widely accepted character set encoding, but it does not cover the entire Unicode font, this also makes it difficult to process special characters in some scenarios ).

Introduction to UTF-8 Coding

In order to better understand the practical application, we briefly introduce the coding Implementation Method of UTF-8. That is, the relationship between the physical storage of the UTF-8 and the conversion of Unicode sequence numbers.

UTF-8 encoding is variable length encoding. The minimum encoding unit is code unit. The first 1-3 bits of a byte are descriptive parts, followed by actual sequence numbers.

If the first byte is 0, it indicates that the current character is a single byte character, occupying one byte space. 7 bits after 0) indicates the serial number in Unicode.

If a byte starts with 110, it indicates that the current character is a double byte character, occupying 2 bytes of space. 7 bits after 110) represent the serial number in Unicode. The second byte starts with 10.

If a byte starts with 1110, it indicates that the current character is a three-byte character, occupying 2 bytes of space. 7 bits after 110) represent the serial number in Unicode. The second and third bytes start with 10.

If a byte starts with 10, the Current byte represents the second byte of Multi-byte characters. 6 bits after 10) represent the serial number in Unicode.

The specific features of each byte can be seen in the table below. x represents the serial number, and all the x parts of each byte are spliced together to form the serial number in the Unicode font library.

Byte 1	Byte 2	Byte3
0xxx xxxx
110x xxxx	10xx xxxx
1110 xxxx	10xx xxxx	10xx xxxx

Let's look at three examples of UTF-8 encoding from one byte to three byte:

Actual characters	In the hexadecimal format of the Unicode font number	Binary of the serial number in the Unicode font	Binary after UTF-8 Encoding	Hexadecimal after UTF-8 Encoding
$	0024	010 0100	0010 0100	24
Bytes	00A2	000 1010 0010	1100 0010 1010 0010	C2 A2
€	20AC	0010 0000 1010 1100	1110 0010 1000 0010 1010 1100	E2 82 AC

It is not difficult for careful readers to draw the following rules from the above brief introduction:

The hexadecimal encoding of the Three-byte UTF-8 must start with E.

The UTF-8 hexadecimal encoding of two bytes must start with C or D

The hexadecimal encoding of a byte UTF-8 must start with a number smaller than 8.

Why is garbled?

Mojibake is an English native statement with garbled characters in popular science.

To put it simply, garbled characters appear because different or incompatible character sets are used for encoding and decoding. Corresponding to the real life, it is like a British who writes bless code on the paper to express their blessings ). While a French got this paper, because bless in French represents injury, so he thinks he wants to express the injury decoding process ). This is a garbled situation in real life. Like in computer science, a character encoded with a UTF-8 is decoded using GBK. Because the two character sets have different font tables and the same Chinese Character Set has different locations in the two character sets, garbled characters will eventually occur.

Let's look at an example: Suppose we use UTF-8 encoding to store very cute two characters, there will be the following conversion:

Character	Hexadecimal after UTF-8 Encoding
Very	E5BE88
	E5B18C

So we get a numerical value such as E5BE88E5B18C. We use GBK decoding to display the display. We can obtain the following information through the query table:

Hexadecimal value of two bytes	Character corresponding to GBK Decoding
E5BE	Atlas
88E5	Bytes
B18C	Bytes

After decoding, we get the result of such an error, even worse, the number of characters has changed.

How to identify Garbled text

To reverse the original correct text from garbled characters, you must have a deep understanding of the encoding rules of each character set. But the principle is very simple, here with the most common UTF-8 is wrong with GBK display garbled as an example, to illustrate the specific Reverse Solution and recognition process.

Step 2 encoding

Suppose we can see Garbled text like on the page, and we know that our browser is currently using GBK encoding. The first step is to encode garbled characters into binary expressions through GBK. Of course, the query table encoding efficiency is very low. We can also use the following SQL statement to directly code through the MySQL client:

 
 
  
  Mysql [localhost] {msandbox}> select hex (convert (' 'using gbk ));
  
  + ------------------------------------- +
  
  | Hex (convert (' 'using gbk) |
  
  + ------------------------------------- +
  
  | E5BE88E5B18C |
  
  + ------------------------------------- +
  
  1 row in set (0.01 sec)

Step 2 Recognition

Now we get the decoded binary string E5BE88E5B18C. Then we split it by byte.
Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
E5 BE 88 E5 B1 8C

Then apply the Rules summarized in the previous chapter of UTF-8 encoding, it is not difficult to find that the 6 bytes of data comply with the UTF-8 encoding rules. If the entire data stream meets this rule, we can boldly assume that the encoding character set before garbled characters is a UTF-8.
Step 2 Decoding

Then we can take E5BE88E5B18C Decoding with a UTF-8, view the text before garbled. Of course, we can directly obtain results through SQL without looking up the table:

 
 
  
  Mysql [localhost] {msandbox} (none)> select convert (0xE5BE88E5B18C using utf8 );
  
  + ------------------------------------ +
  
  | Convert (0xE5BE88E5B18C using utf8) |
  
  + ------------------------------------ +
  
  | Awesome |
  
  + ------------------------------------ +
  
  1 row in set (0.00 sec)

Troubleshooting of common problems with Emoji

Emoji is a character in the \ u1F601-\ u1F64F segment of Unicode. This apparently exceeded the encoding range \ u0000-\ uFFFF of the commonly used UTF-8 character set. With the popularity and support of IOS, Emoji is becoming more and more common. Below are some common Emoji:

So what is the impact of Emoji on our daily development and O & M? The most common problem is that it is stored in the MySQL database. In general, the default Character Set of MySQL database will be configured to UTF-8 three bytes), and utf8mb4 is supported after 5.5, there are few DBA initiative to change the system default character set to utf8mb4. Then the problem comes. When we store a character that requires 4-byte UTF-8 encoding to be represented in the database, the ERROR: ERROR 1366: Incorrect string value: '\ xF0 \ x9D \ x8C \ x86' for column.

If you carefully read the above explanation, it is not difficult to understand this error. We try to insert a Bytes string into a column, and the first Bytes byte is \ xF0 which means it is a four-byte UTF-8 encoding. But when MySQL tables and column character sets are configured as UTF-8, such characters cannot be stored, so an error is reported.

How can we solve this problem? You can upgrade MySQL to MySQL 5.6 or later and switch the table character set to utf8mb4. The second method is to filter the content before it is stored in the database, replace the Emoji character with a special text encoding, and then store it in the database. Then, the special text encoding will be converted to Emoji display when it is obtained from the database or displayed on the front end. The second method is to replace the 4-byte Emoji with-*-1F601-*. For details about how to implement python code, see the answer on Stackoverflow.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More