Character Set---Go

Source: Internet
Author: User

Character set: A simple character set specifies the binary number (encoding) of a text corresponding to the conversion relationship of which text (decoding) is represented by a string of binary values; The character set is just the name of a collection of rules, equivalent to English, Chinese. A character set to correctly encode transcoding a character requires three key elements: the font table, the coded character set, and the character encoding.

Binary
Character Set 16 binary encodingdata corresponding to
UTF-8 0xe5b18c 1110 0101 1011 0001 1000 1100
UTF-16 0x5c4c 1011 1000 1001 1000
GBK 0x8cc5 1000 1100 1100 0101

Font table: A database equivalent to all readable or visible characters, which determines the range of characters that the entire character set can display

Coded character set: That is, a coded value code point is used to indicate the position of a character in the font.

Character encoding: The conversion relationship between the coded character set and the actual stored value. In general, the value of code point is directly stored directly as the encoded value.

The relationship between UTF-8 and Unicode is relatively straightforward. Unicode is the coded character set mentioned above, and UTF-8 is a character encoding, a form of implementation of the Unicode rule font. With the development of the Internet, the requirement of the same library is becoming more and more urgent, and the Unicode standard will appear naturally. It covers almost all the symbols and words that may appear in the languages of each country and will number them

How to identify garbled text that you want to express

In order to reverse the original correct text from garbled characters, we must have a deep grasp of the rules of each character set encoding. But the principle is very simple, here with the most common UTF-8 is wrong with GBK display garbled as an example, to illustrate the specific anti-and identification process.

1th Step Encoding

Suppose we see this garbled on the page 寰堝睂 and know that our browser is currently using GBK encoding. So in the first step we can encode the garbled code into binary expressions by GBK. Of course, the table coding efficiency is very low, we can also use the following SQL statement directly through the MySQL client to do coding work:

1234567 mysql [localhost] {msandbox} > select hex(convert(‘寰堝睂‘ using gbk));+-------------------------------------+| hex(convert(‘寰堝睂‘ using gbk))    |+-------------------------------------+| E5BE88E5B18C                        |+-------------------------------------+1 row in set (0.01 sec)
2nd Step Identification

Now we get the decoded binary string E5BE88E5B18C . Then we take it apart by byte.

Byte 1 Byte 2 Byte 3 Byte 4 Byte 5 Byte 6
E5 Be 88 E5 B1 8C

Then apply the UTF-8 code before the introduction of the rules summarized in the chapter, it is not difficult to find that the 6-byte data conforms to the UTF-8 encoding rules. If the entire data flow conforms to this rule, we can boldly assume that the coded character set before garbled is UTF-8

3rd Step Decoding

Then we can take the E5BE88E5B18C UTF-8 decoding, look at the text before garbled. Of course we can get results directly from SQL without looking at the table:

1234567 mysql [localhost] {msandbox} ((none)) > select convert(0xE5BE88E5B18C using utf8);+------------------------------------+| convert(0xE5BE88E5B18C using utf8) |+------------------------------------+| 很屌                               |+------------------------------------+1 row in set (0.00 sec)
The emoji of common problem handling

The so-called emoji is a character that is in \u1F601 the Unicode- \u1F64F section. This obviously exceeds the encoding range of the currently used UTF-8 character set \u0000 - \uFFFF . Emoji expression with the popularity and support of iOS is becoming more and more common. Here are a few common emoji:

So what effect does emoji character expression have on our usual development and operation? The most common problem is when you put him in the MySQL database. In general, the default character set for MySQL databases is configured to be UTF-8 (three bytes), and UTF8MB4 is supported after 5.5, and few DBAs actively change the system default character set to UTF8MB4. Then the problem is, when we put a 4-byte UTF-8 code to represent the character in the database when the error: ERROR 1366: Incorrect string value: ‘\xF0\x9D\x8C\x86‘ for column . If you read the above explanation carefully, then this error is not ugly to understand. We tried to insert a string of bytes into a column, and the first byte of the string bytes was \xF0 meant to be a four-byte UTF-8 encoding. However, when the MySQL table and column character set are configured as UTF-8, it is not possible to store such characters, so the error is reported.

So how do we solve this situation? There are two ways: upgrade MySQL to 5.6 or later, and switch the table character set to UTF8MB4. The second method is to filter the content before depositing it into the database, replacing the emoji character with a special text encoding and then depositing it into the database. This special text encoding is then converted to emoji display from the database or the front-end display. The second approach we assume is -*-1F601-*- to replace the 4-byte emoji, so the implementation of the Python code can be found on the StackOverflow answer

Character Set---Go

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.