10 minutes to understand character set and character encoding

Source: Internet
Author: User

Original: http://cenalulu.github.io/linux/character-encoding/ Lu Junyi

This article will briefly describe the concept of character set and character encoding. And some common diagnostic techniques when encountering garbled characters.

Background: Character sets and encodings are undoubtedly headache problems for it novices and even the great gods. When encountering complex character sets, various Martian texts and garbled characters, the problem is often difficult to locate.

This article will be from the principle of the character set and code to do a simple introduction of popular science, but also introduce some common garbled fault location method to facilitate the reader later can more calmly locate related issues.

Before the formal introduction, make a small statement: If you want to understand the interpretation of the various nouns very precisely, then you can refer to Wikipedia. This article is the introduction of bloggers through their own understanding after digestion and transformed into easy-to-understand expression.

What is a character set

Before we introduce the character set, let's understand why we have a character set. What we see on the computer screen is the solid text, and the actual binary bitstream that is stored in the computer's storage media. Then the conversion between the two rules need a unified standard, or we plug the USB stick to the boss's computer, the document is garbled; Small partners QQ uploaded files, in our local open and garbled. So in order to achieve the conversion criteria, the various character set standards appear. Simply speaking, the character set specifies the binary number (encoding) in which a text corresponds and the conversion relationship of which text (decoding) is represented by a string of binary values.

So why are there so many character set standards? The question is actually very easy to answer. Ask yourself why we can't use the plug to get to the UK. Why does the monitor have DVI,VGA,HDMI,DP so many interfaces at the same time? Many of the norms and standards were initially formulated without realizing that this would be a universal norm later on, or that the interests of the Organization itself would be fundamentally different from existing standards. As a result, there are so many standards that have the same effect but are not compatible with each other.

That's a lot of talk. Let's look at a practical example, and here's the word for Dick. 16 binary and binary encoding results under various encodings, how is there a feeling of a cock?

Binary
Character Set 16 binary encodingdata corresponding to
UTF-8 0xe5b18c 1110 0101 1011 0001 1000 1100
UTF-16 0x5c4c 1011 1000 1001 1000
GBK 0x8cc5 1000 1100 1100 0101
What is character encoding

A character set is simply the name of a collection of rules, which corresponds to a real life, and a character set is a salutation to a language. For example: English, Chinese, Japanese. For a character set to correctly encode transcoding a character requires three key elements: the font table (character repertoire), the coded character set (coded character set), the character encoding (character encoding form). Where the font table is a database equivalent to all readable or visible characters, the font table determines the range of all characters represented by the entire character set. Encodes a character set, that is, a coded value code point to represent the position of a character in the font. Character encoding, which encodes the conversion relationship between the character set and the actual stored value. In general, the value of code point is directly stored directly as the encoded value. For example, in ASCII A is ranked 65th in the table, and the value of a after encoding is 0100 0001, which is the binary conversion result of decimal 65.

See here, perhaps a lot of readers will have and my original question: Font table and coded character set appears to be necessary, since every character in the font table has its own serial number, directly to the serial number as the storage content is good. Why superfluous to convert serial numbers to another storage format by character encoding?

In fact, the reason is also relatively easy to understand: The purpose of the unified font table is to be able to cover all the world's characters, but the actual use of the process will find the real use of the word typeface on the whole font table is very low proportion. For example, Chinese-language programs rarely require Japanese characters, and some English-speaking countries or even simple ASCII font tables can meet basic needs. And if each character is stored with the serial number in the Font table, each character needs to be 3 bytes (This is the case with the Unicode font), which is obviously an extra cost for an English-speaking region country that used only one character ASCII encoding (storage volume is three times times the original). Calculated directly, the same hard disk, with ASCII can save 1500 articles, and 3-byte Unicode ordinal storage can only save 500. So there is a UTF-8 such a variable length code. In UTF-8 encoding, only one byte of ASCII character is required and still only takes one byte. Complex characters, such as Chinese and Japanese, require 2 to 3 bytes to be stored.

The relationship between UTF-8 and Unicode

After reading the above two conceptual explanations, it is easier to explain the relationship between UTF-8 and Unicode. Unicode is the coded character set mentioned above, and UTF-8 is a character encoding, a form of implementation of the Unicode rule font. With the development of the Internet, the requirement of the same library is becoming more and more urgent, and the Unicode standard will appear naturally. It covers almost all the symbols and words that may appear in each country's language and will be numbered for them. See: Unicode on Wikipedia.

The Unicode numbering starts from 0000 to 10FFFF and is divided into 16 plane, with 65,536 characters in each plane. While UTF-8 only implemented the first plane, it is visible that UTF-8 is one of the most widely accepted character set encodings today, but it does not cover the entire Unicode font, which also makes it difficult to handle special characters in some scenarios (as mentioned below).

Introduction to UTF-8 Coding

In order to better understand the practical application behind, we briefly introduce the implementation method of UTF-8 code. That is, the conversion relationship between the physical storage of UTF-8 and the Unicode ordinal.

The UTF-8 encoding is a variable-length encoding. The minimum Encoding Unit (code unit) is a single byte. The first 1-3 bits of a byte are descriptive parts, followed by the actual ordinal part.

    • If the first bit of a byte is 0, then the current character is a single-byte character, occupying one byte of space. All sections after 0 (7 bit) represent the ordinal number in Unicode.
    • If a byte starts with 110, it represents the current character as a double-byte character and occupies 2 bytes of space. All sections after 110 (7 bit) represent the ordinal number in Unicode. And the second byte starts with 10
    • If a byte starts with 1110, it represents the current character as three-byte character and occupies 2 bytes of space. All sections after 110 (7 bit) represent the ordinal number in Unicode. And the second and third bytes start with 10
    • If a byte starts with 10, the current byte is the second byte of a multibyte character. All sections after 10 (6 bit) represent the ordinal number in Unicode.

The characteristics of each byte are visible in the following table, where x represents the ordinal part, and all the x parts of each byte are stitched together to form the ordinal number in the Unicode font.

Byte 1 Byte 2 Byte3
0xxx xxxx
110x xxxx 10xx xxxx
1110 xxxx 10xx xxxx 10xx xxxx

Let's look at three UTF-8 encoding examples from one byte to three bytes, respectively:

Actual character Hexadecimal in Unicode font ordinal Binary in Unicode font number UTF-8 binary after encoding UTF-8 hexadecimal after encoding
$ 0024 010 0100 0010 0100 24
¢ 10 2 000 1010 0010 1100 0010 1010 0010 C2 A2
20AC 0010 0000 1010 1100 1110 0010 1000 0010 1010 1100 E2-AC

Careful readers are not difficult to draw from the above simple introduction of the following rules:

    • 3 bytes of UTF-8 hexadecimal encoding must start with E
    • A 2-byte UTF-8 hexadecimal code must be a C or D
    • 1 bytes of UTF-8 hexadecimal encoding must start with a number smaller than 8
Why do garbled characters occur?

First popular science garbled English native saying is mojibake.

Simple garbled appearance is because: encoding and decoding with a different or incompatible character set. In real life, it's like an Englishman writing a bless (coding process) on paper to express his blessing. And a Frenchman got this piece of paper, because bless in French meant to be injured, so he thought he wanted to express the injury (the decoding process). This is a real-life garbled situation. In computer science, a character encoded with UTF-8 is decoded with GBK. Because the two character sets of the font table is not the same, the same Chinese character in the two characters in the location of the character sheet is also different, eventually garbled.

Let's take a look at an example: Suppose we use UTF-8 encoding to store very two characters, there will be conversions like this:

character UTF-8 hexadecimal after encoding
Is E5be88
Cock e5b18c

So we got e5be88e5b18c a string of values. and display when we use GBK decoding to display, through the table we get the following information:

two-byte hexadecimal value GBK the corresponding character after decoding
E5be Atlanto -
88E5
b18c

After decoding we get the Atlas 堝 睂 so a wrong result, even more deadly is the number of hyphens have changed.

How to identify garbled text that you want to express

In order to reverse the original correct text from garbled characters, we must have a deep grasp of the rules of each character set encoding. But the principle is very simple, here with the most common UTF-8 is wrong with GBK display garbled as an example, to illustrate the specific anti-and identification process.

1th Step Encoding

Let's say we see the garbled 堝 睂 on the page, and we know that our browser is currently using GBK encoding. So in the first step we can encode the garbled code into binary expressions by GBK. Of course, the table coding efficiency is very low, we can also use the following SQL statement directly through the MySQL client to do coding work:

  1. MySQL [localhost] {msandbox} > Select hex(Convert (' Atlas 堝 睂 ' using gbk);
  2. +-------------------------------------+
  3. | Hex(convert(' Atlas 堝 睂 ' using gbk) |
  4. +-------------------------------------+
  5. | e5be88e5b18c |
  6. +-------------------------------------+
  7. 1 row in set (0.01 sec)
2nd Step Identification

Now we get the decoded binary string e5be88e5b18c. Then we take it apart by byte.

Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
E5 Be 88 E5 B1 8C

Then apply the UTF-8 code before the introduction of the rules summarized in the chapter, it is not difficult to find that the 6-byte data conforms to the UTF-8 encoding rules. If the entire data flow conforms to this rule, we can boldly assume that the coded character set before garbled is UTF-8

3rd Step Decoding

Then we can take the e5be88e5b18c with UTF-8 decoding, look at the text before garbled. Of course we can get results directly from SQL without looking at the table:

  1. MySQL [localhost] {msandbox} ((none)) > Select convert(0xe5be88e5b18c using utf8);
  2. +------------------------------------+
  3. | convert(0xe5be88e5b18c using UTF8) |
  4. +------------------------------------+
  5. | very dick |
  6. +------------------------------------+
  7. 1 row in set (0.00 sec)
The emoji of common problem handling

The so-called emoji is a character that is in the \u1f601-\u1f64f section of Unicode. This obviously exceeds the encoding range of the currently used UTF-8 character set \U0000-\UFFFF. Emoji expression with the popularity and support of iOS is becoming more and more common. Here are a few common emoji:

So what effect does emoji character expression have on our usual development and operation? The most common problem is when you put him in the MySQL database. In general, the default character set for MySQL databases is configured to be UTF-8 (three bytes), and UTF8MB4 is supported after 5.5, and few DBAs actively change the system default character set to UTF8MB4. So the problem is, when we put a character that requires 4 bytes of UTF-8 encoding to be stored in the database, we get an error: Error 1366:incorrect string value: ' \xf0\x9d\x8c\x86 ' for column.

If you read the above explanation carefully, then this error is not ugly to understand. We tried to insert a string of bytes into a column, and the first byte of the string bytes was \xf0 meaning it was a four-byte UTF-8 encoding. However, when the MySQL table and column character set are configured as UTF-8, it is not possible to store such characters, so the error is reported.

So how do we solve this situation? There are two ways: upgrade MySQL to 5.6 or later, and switch the table character set to UTF8MB4. The second method is to filter the content before depositing it into the database, replacing the emoji character with a special text encoding and then depositing it into the database. This special text encoding is then converted to emoji display from the database or the front-end display. The second approach we assume is to replace the 4-byte emoji with-*-1f601-*-, so the implementation of the Python code can be found in the StackOverflow answer.

10 minutes to understand character set and character encoding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.