Character Set and character encoding

Source: Internet
Author: User
Tags hex code

  • Reprinted from: Here
    This article will briefly describe the concept of character set and character encoding. And some common diagnostic techniques when encountering garbled characters.

Background: Character sets and encodings are undoubtedly headache problems for it novices and even the great gods. When encountering complex character sets, various Martian texts and garbled characters, the problem is often difficult to locate. This article will be from the principle of the character set and code to do a simple introduction of popular science, but also introduce some common garbled fault location method to facilitate the reader later can more calmly locate related issues. Before the formal introduction, make a small statement: If you want to understand the interpretation of the various nouns very precisely, then you can refer to Wikipedia. This article is the introduction of bloggers through their own understanding after digestion and transformed into easy-to-understand expression.

What is a character set

Before we introduce the character set, let's understand why we have a character set. What we see on the computer screen is the solid text, and the actual binary bitstream that is stored in the computer's storage media. Then the conversion between the two rules need a unified standard, or we plug the USB stick to the boss's computer, the document is garbled; Small partners QQ uploaded files, in our local open and garbled. So in order to achieve the conversion criteria, the various character set standards appear. Simply speaking, the character set specifies the binary number (encoding) in which a text corresponds and the conversion relationship of which text (decoding) is represented by a string of binary values.
So why are there so many character set standards? The question is actually very easy to answer. Ask yourself why we can't use the plug to get to the UK. Why does the monitor have DVI,VGA,HDMI,DP so many interfaces at the same time? Many of the norms and standards were initially formulated without realizing that this would be a universal norm later on, or that the interests of the Organization itself would be fundamentally different from existing standards. As a result, there are so many standards that have the same effect but are not compatible with each other.
Said so much we look at a practical example, the following is the word in various encodings of the 16 binary and binary encoding results, how is there a very cock feeling?

Character Set |16 encoding | binary data
|-|-|
utf-8|0xe5b18c|1110 0101 1011 0001 1000 1100
UTF-16|0X5C4C|1011 1000 1001 1000
gbk|0x8cc5|1000 1100 1100 0101

What is character encoding

A character set is simply the name of a collection of rules, which corresponds to a real life, and a character set is a salutation to a language. For example: English, Chinese, Japanese. For a character set to correctly encode transcoding a character requires three key elements: the font table (character repertoire), the coded character set (coded character set), the character encoding (character encoding form). Where the font table is a database equivalent to all readable or visible characters, the font table determines the range of all characters represented by the entire character set. Encodes a character set, that is, a coded value that code point represents the position of a character in the font. Character encoding, which encodes the conversion relationship between the character set and the actual stored value. code pointin general, the values are stored directly as encoded values directly. For example, in ASCII A , the 65th Digit is ranked in the table, and the encoded A value is the 0100 0001 binary conversion result that is also the decimal 65.
See here, perhaps a lot of readers will have and my original question: 字库表 and 编码字符集 It seems to be necessary, since every character in the font table has a number of its own, directly to the serial number as the storage content is good. Why do you want to superfluous by 字符编码 converting the serial number to another storage format? In fact, the reason is also relatively easy to understand: The purpose of the unified font table is to be able to cover all the world's characters, but the actual use of the process will find the real use of the word typeface on the whole font table is very low proportion. For example, Chinese-language programs rarely require Japanese characters, and some English-speaking countries or even simple ASCII font tables can meet basic needs. And if each character is stored with the serial number in the Font table, each character needs to be 3 bytes (This is the case with the Unicode font), which is obviously an extra cost for an English-speaking region country that used only one character ASCII encoding (storage volume is three times times the original). Calculated directly, the same hard disk, with ASCII can save 1500 articles, and 3-byte Unicode ordinal storage can only save 500. So there is a UTF-8 such a variable length code. In UTF-8 encoding, only one byte of ASCII character is required and still only takes one byte. Complex characters, such as Chinese and Japanese, require 2 to 3 bytes to be stored.

The relationship between UTF-8 and Unicode

After reading the above two conceptual explanations, it is easier to explain the relationship between UTF-8 and Unicode. Unicode is the coded character set mentioned above, and UTF-8 is a character encoding, a form of implementation of the Unicode rule font. With the development of the Internet, the requirement of the same library is becoming more and more urgent, and the Unicode standard will appear naturally. It covers almost all the symbols and words that may appear in each country's language and will be numbered for them. See: Unicode on Wikipedia. The Unicode numbering starts from the 0000 beginning to a 10FFFF total of 16 plane, with 65,536 characters in each plane. While UTF-8 only implemented the first plane, it is visible that UTF-8 is one of the most widely accepted character set encodings today, but it does not cover the entire Unicode font, which also makes it difficult to handle special characters in some scenarios (as mentioned below).

Introduction to UTF-8 Coding

In order to better understand the practical application behind, we briefly introduce the implementation method of UTF-8 code. That is, the conversion relationship between the physical storage of UTF-8 and the Unicode ordinal.
The UTF-8 encoding is a variable-length encoding. The Minimum Encoding unit ( code unit ) is a single byte. The first 1-3 bits of a byte are descriptive parts, followed by the actual ordinal part.

    • If the first bit of a byte is 0, then the current character is a single-byte character, occupying one byte of space. All sections after 0 (7 bit) represent the ordinal number in Unicode.
    • If a byte starts with 110, it represents the current character as a double-byte character and occupies 2 bytes of space. All parts after 110 (5 bit) plus the next byte except 10 (6 bit) represent the ordinal number in Unicode. And the second byte starts with 10
    • If a byte starts with 1110, it represents the current character as three-byte character and occupies 2 bytes of space. All parts after 110 (5 bit) plus the next two bytes except 10 (12 bit) represent the ordinal number in Unicode. And the second and third bytes start with 10
    • If a byte starts with 10, the current byte is the second byte of a multibyte character. All the parts after 10 (6 bit) and the previous ones are numbered together in Unicode.

The specific characteristics of each byte are visible in the following table, which x represents the ordinal part, and all the parts of each byte are x stitched together to form the ordinal number in the Unicode font.

Byte 1| Byte 2| Byte3
|-|-|
0xxx xxxx| |
110x xxxx|10xx xxxx|
1110 XXXX|10XX xxxx|10xx xxxx

Let's look at three UTF-8 encoding examples from one byte to three bytes, respectively:

Actual characters | Hexadecimal in Unicode font ordinal | Binary in Unicode font number | UTF-8 Encoded Binary | UTF-8 hexadecimal after encoding
$|0024|010 0100 |0010 0100|24
¢|00a2|000 1010 0010|1100 0010 1010 0010| C2 A2
€|20ac|0010 0000 1010 1100|1110 0010 1000 0010 1010 1100| E2-AC

Careful readers are not difficult to draw from the above simple introduction of the following rules:

    • 3 bytes of UTF-8 hex code must be at E the beginning of the
    • 2 bytes of UTF-8 hexadecimal encoding must be C or begin with D
    • 1 bytes of UTF-8 hexadecimal encoding must start with a 8 smaller number
Why do garbled characters occur?

乱码That is what English often says mojibake (by Japanese 文字化け transliteration).
Simple garbled appearance is because: encoding and decoding with a different or incompatible character set. Correspondence to real life is like an Englishman writing on paper bless (coding process) in order to express blessings. And a Frenchman got this piece of paper, because in French bless mean to hurt, so think he wanted to express is 受伤 (decoding process). This is a real-life garbled situation. In computer science, a character encoded with UTF-8 is decoded with GBK. Because the two character sets of the font table is not the same, the same Chinese character in the two characters in the location of the character sheet is also different, eventually garbled.
Let's take a look at an example: suppose we store two words with UTF-8 encoding 很屌 , there is the following conversion:

Characters | UTF-8 hexadecimal after encoding
|-|-|
Very | E5be88
Dick | e5b18c

So we got a E5BE88E5B18C bunch of numbers. and display when we use GBK decoding to display, through the table we get the following information:

Hexadecimal value of two bytes | GBK the corresponding character after decoding
|-|-|
e5be| Atlas
88e5| 堝
b18c| 睂

After decoding we get the 寰堝睂 result of such a mistake, and the more deadly is the number of hyphens has changed.

How to identify garbled text that you want to express

In order to reverse the original correct text from garbled characters, we must have a deep grasp of the rules of each character set encoding. But the principle is very simple, here with the most common UTF-8 is wrong with GBK display garbled as an example, to illustrate the specific anti-and identification process.

1th Step Encoding

Suppose we see this garbled on the page 寰堝睂 and know that our browser is currently using GBK encoding. So in the first step we can encode the garbled code into binary expressions by GBK. Of course, the table coding efficiency is very low, we can also use the following SQL statement directly through the MySQL client to do coding work:
{% highlight MySQL%}
{% RAW%}
mysql [localhost] {msandbox} > select Hex (CONVERT (' Atlas 堝 睂 ' using GBK);
+ ————————————-+
| Hex (CONVERT (' Atlas 堝 睂 ' using GBK) |
+ ————————————-+
| e5be88e5b18c |
+ ————————————-+
1 row in Set (0.01 sec)
{% Endraw%}
{% Endhighlight%}

2nd Step Identification

Now we get the decoded binary string E5BE88E5B18C . Then we take it apart by byte.

Byte 1| Byte 2| Byte 3| Byte 4| Byte 5| Byte 6
|-|-|-|-|-|-|
e5|be|88| e5| b1|8c

Then apply the UTF-8 code before the introduction of the rules summarized in the chapter, it is not difficult to find that the 6-byte data conforms to the UTF-8 encoding rules. If the entire data flow conforms to this rule, we can boldly assume that the coded character set before garbled is UTF-8

3rd Step Decoding

Then we can take the E5BE88E5B18C UTF-8 decoding, look at the text before garbled. Of course we can get results directly from SQL without looking at the table:
{% highlight MySQL%}
{% RAW%}
mysql [localhost] {msandbox} ((none)) > select CONVERT (0xe5be88e5b18c using UTF8);
+ ———————————— +
| Convert (0xe5be88e5b18c using UTF8) |
+ ———————————— +
| Very Dick |
+ ———————————— +
1 row in Set (0.00 sec)
{% Endraw%}
{% Endhighlight%}

The emoji of common problem handling

The so-called emoji is a character that is in \u1F601 the Unicode- \u1F64F section. This obviously exceeds the encoding range of the currently used UTF-8 character set \u0000 - \uFFFF . Emoji expression with the popularity and support of iOS is becoming more and more common. Here are a few common emoji:



So what effect does emoji character expression have on our usual development and operation? The most common problem is when you put him in the MySQL database. In general, the default character set for MySQL databases is configured to be UTF-8 (three bytes), and UTF8MB4 is supported after 5.5, and few DBAs actively change the system default character set to UTF8MB4. Then the problem is, when we put a 4-byte UTF-8 code to represent the character in the database when the error: ERROR 1366: Incorrect string value: ‘\xF0\x9D\x8C\x86‘ for column . If you read the above explanation carefully, then this error is not ugly to understand. We tried to insert a string of bytes into a column, and the first byte of the string bytes was \xF0 meant to be a four-byte UTF-8 encoding. However, when the MySQL table and column character set are configured as UTF-8, it is not possible to store such characters, so the error is reported.
So how do we solve this situation? There are two ways: upgrade MySQL to 5.6 or later, and switch the table character set to UTF8MB4. The second method is to filter the content before depositing it into the database, replacing the emoji character with a special text encoding and then depositing it into the database. This special text encoding is then converted to emoji display from the database or the front-end display. The second approach we assume is -*-1F601-*- to replace the 4-byte emoji, so the implementation of the Python code can be found on the StackOverflow answer

Reference

[How to configure the Python default character set] ({{Site.url}}/python/python-encoding)
Character-coded notes: Ascii,unicode and UTF-8
Unicode Chinese encoding Table
Emoji Unicode Table
Every Developer should Know about the Encoding

Character Set and character encoding

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.