The ultimate success story of MySQL Chinese garbled characters

Source: Internet
Author: User
Tags character set error code mysql client

Causes of MySQL garbled characters

To understand why garbled characters occur, we must first understand the process of initiating a request from the client, storing data in MySQL, and then retrieving the client from the table next time, which links are encoded/decoded. To better explain this process, the blogger has created two flowcharts, which correspond to the two stages of storage and retrieval.

Encoding and conversion process stored in MySQL


The preceding figure shows three encoding/decoding processes (red arrows ). The three red arrows correspond to the Client encoding, MySQL Server decoding, and Client encoding to table encoding respectively. Terminal can be a Bash, a web page, or an APP. In this article, we assume that Bash is our Terminal, that is, the user input and display interface. The behavior of each box in the figure is as follows:

Use input method in terminal

Terminal is converted to a binary stream based on character encoding.

Binary data is transmitted to the MySQL Server through the MySQL client.

Server decoding through character-set-client

Determine whether character-set-client is consistent with the charset of the target table

If they are inconsistent, a character encoding conversion from client-charset to table-charset is performed.

Store the converted character encoding binary stream to a file.

The encoding and conversion process for extracting data from the MySQL table




The preceding figure shows three encoding/decoding processes (red arrows ). The three red arrows in the preceding figure correspond to client decoding. MySQL Server converts table encoding to character-set-client encoding based on character-set-client encoding.

Read binary data streams from files

Decoding using table character set encoding

Convert data to character-set-client encoding

Use character-set-client to encode the binary stream

The Server transmits data to the remote client over the network.

The client uses the character encoding configured in bash to display the query results.

Causes of MySQL garbled characters

1. The encoding of the corresponding link during storage and retrieval is inconsistent


This will cause garbled characters to be obvious. The character sets used for the three encoding/decoding in the stored phase are C1, C2, and C3 (Figure 1 shows from left to right). The three character sets are numbered C1' and C2' in sequence ', c3' (left to right ). So when the bash C1 is stored in the UTF-8 encoding, take out, C1 'we use the windows Terminal (default is GBK encoding), then the result is almost certainly garbled. Or set names utf8 (C2) when saving MySQL, but set names gbk (C2 ') is used during the extraction, so the result must be garbled.

2. The three-step encoding in a single process is inconsistent

That is, in the three steps in the same direction in any of the above images, the encoding/decoding error may occur if the two or more steps are inconsistent. If the two different character sets cannot undergo lossless encoding conversion (detailed below), garbled characters will occur. For example, if our shell is UTF8 encoded, MySQL's character-set-client is configured as GBK, and the table structure is charset = utf8, there will be no doubt there will be garbled characters.

Here we will give a simple demonstration of this situation:



Master [localhost] {msandbox} (test)> create table charset_test_utf8 (id int primary key auto_increment, char_col varchar (50) charset = utf8;
Query OK, 0 rows affected (0.04 sec)

Master [localhost] {msandbox} (test)> set names gbk;
Query OK, 0 rows affected (0.00 sec)

Master [localhost] {msandbox} (test)> insert into charset_test_utf8 (char_col) values ('China ');
Query OK, 1 row affected, 1 warning (0.01 sec)

Master [localhost] {msandbox} (test)> show warnings;
+ --------- + ------ + ----------------------------------------------------------------------------- +
| Level | Code | Message |
+ --------- + ------ + ----------------------------------------------------------------------------- +
| Warning | 1366 | Incorrect string value: '\ xAD \ xE6 \ x96 \ x87' for column 'char _ col' at row 1 |
+ --------- + ------ + ----------------------------------------------------------------------------- +
1 row in set (0.00 sec)

Master [localhost] {msandbox} (test)> select id, hex (char_col), char_col from charset_test_utf8;
+ ---- + ---------------- + ---------- +
| Id | hex (char_col) | char_col |
+ ---- + ---------------- + ---------- +
| 1 | E6B6933FE69E83 | ???? |
+ ---- + ---------------- + ---------- +
1 row in set (0.01 sec)

MySQL encoding/decoding

Since the binary stream is transmitted between systems, it is good to directly store the binary stream into the table file. Why do I need to perform two codec operations before storage?

The reason for the Client-to-Server encoding and decoding is that MySQL needs to perform syntax and lexical parsing on the sent binary stream. Without encoding parsing and verification, we cannot even know whether the binary stream is insert or update.

The codec of File to Engine is used to know the word segmentation in the binary stream. For example, we want to extract the first two characters of a field from the table and execute a statement in the form of select left (col, 2) from table, the value of the column read by the storage engine from the file is E4B8ADE69687. At this time, if we divide the value into E4B8, ADE6, and 9687 words by GBK, then the returned client value should be E4B8ADE6; if we divide the value into E4B8AD by UTF8, e69687, the E4B8ADE69687 character should be returned. It can be seen that character-level operations cannot be performed within the storage engine without encoding/decoding after data is read from a data file.

Error in/out

The most common cause of gibberish in MySQL is to make the mistake a myth. The so-called error code is that the character encoding format of the client (web or shell) is different from that of the final table, however, as long as the two character sets are consistent, the output without garbled characters can still be obtained. However, the error-in and error-out are not valid for the combination of any two character set encodings. Assume that the client encoding is C and the MySQL table character set encoding is S. In order to be able to return errors, the following two conditions must be met:

When MySQL receives a request, the binary stream encoded from C can be lossless when it is decoded by S.

MySQL returns data that is lossless when the binary stream encoded from S is decoded by C.

Lossless encoding conversion

So what is lossy conversion and lossless conversion? Suppose we want to convert the character X represented by encoding A into the expression of encoding B, while the font set of encoding B does not contain the character X, in this case, the conversion is lossy. So why is there a difference between two encodings that can represent character set combination? If you have read the previous article about character set and character encoding, or have a basic understanding of character encoding, you should know that each character set supports a limited number of characters, in addition, there are differences between the texts covered by each character set. UTF8 and GBK can represent the following number of characters:

The value range of a single GBK character encoding is: 8140-FEFE, excluding ** 7E. The total number of characters is about 27000.

After UTF-8 is encoded as a single character, the values vary according to the number of bytes, as shown in the following table:




Because UTF-8 encoding can represent more than GBK characters. Then we can easily find a lossy conversion from UTF8 to GBK. We use the character (see the figure below) to find a character that is obviously not in the GBK encoding table and try to store it in the GBK encoding table. And view the lossy conversion behavior again.

The specific character information is :? Gurmukhi letter a Unicode: U + 0A05, UTF-8: E0 A8 85


The storage in MySQL is as follows:

Master [localhost] {msandbox} (test)> create table charset_test_gbk (id int primary key auto_increment, char_col varchar (50) charset = gbk;
Query OK, 0 rows affected (0.00 sec)

Master [localhost] {msandbox} (test)> set names utf8;
Query OK, 0 rows affected (0.00 sec)

Master [localhost] {msandbox} (test)> insert into charset_test_gbk (char_col) values ('? ');
Query OK, 1 row affected, 1 warning (0.01 sec)

Master [localhost] {msandbox} (test)> show warnings;
+ --------- + ------ + ------------------------------------------------------------------------- +
| Level | Code | Message |
+ --------- + ------ + ------------------------------------------------------------------------- +
| Warning | 1366 | Incorrect string value: '\ xE0 \ xA8 \ x85' for column 'char _ col' at row 1 |
+ --------- + ------ + ------------------------------------------------------------------------- +
1 row in set (0.00 sec)

Master [localhost] {msandbox} (test)> select id, hex (char_col), char_col, char_length (char_col) from charset_test_gbk;
+ ---- + --------------- + ---------- + ----------------------- +
| Id | hex (char_col) | char_col | char_length (char_col) |
+ ---- + --------------- + ---------- + ----------------------- +
| 1 | 3F |? | 1 |
+ ---- + --------------- + ---------- + ----------------------- +
1 row in set (0.00 sec)

The error occurs in step 1 of codec. See the following figure for details.


It can be seen that if MySQL cannot find a GBK character corresponding to the UTF8 character, it will be converted into an error mark (here is the question mark ). During program implementation, each character set defines the behavior and conversion rules when such a situation occurs. For example, if the corresponding character cannot be found in UTF8, replace it? (U + FFFD)

Is the conversion between any two character set encodings lossy? Otherwise, whether the conversion is lossy depends on the following points:

Whether the converted characters are in both character sets

Whether the target character set can retain its original expression form for unsupported characters

The first point has been explained through experiments. Here we will explain the second cause of lossy conversions. In the preceding example, we can see that GBK uses an error identifier instead of 0x3F to process characters that cannot be expressed by itself. Some character sets (such as latin1) retain the encoding data of the original character set when encountering characters that cannot be expressed by themselves, and skip this character to process subsequent data. If the target character set has such a feature, it will be able to achieve the error-in and error-out effect mentioned at the beginning of this section.

Let's take a look at the following example:



Master [localhost] {msandbox} (test)> create table charset_test (id int primary key auto_increment, char_col varchar (50) charset = latin1;
Query OK, 0 rows affected (0.03 sec)

Master [localhost] {msandbox} (test)> set names latin1;
Query OK, 0 rows affected (0.00 sec)

Master [localhost] {msandbox} (test)> insert into charset_test (char_col) values ('Chinese ');
Query OK, 1 row affected (0.01 sec)

Master [localhost] {msandbox} (test)> select id, hex (char_col), char_col from charset_test;
+ ---- + --------------- + ---------- +
| Id | hex (char_col) | char_col |
+ ---- + --------------- + ---------- +
| 2 | E4B8ADE69687 | Chinese |
+ ---- + --------------- + ---------- +
2 rows in set (0.00 sec)

The flowchart is as follows. It can be seen that the encoding inconsistency has actually occurred after it is received by the MySQL Server. However, because the Latin1 character set does not process any characters outside its own expression range, it retains the original value. This behavior also makes the error possible.


How to avoid garbled characters

After understanding the above content, it is easy to avoid garbled characters. As long as the three character sets are "three in one", namely, the client, MySQL character-set-client, and table charset, are completely consistent, there will be no garbled characters. However, it is difficult to fix data that has been garbled or has suffered lossy transcoding. In the next section, we will detail the specific methods.

How to fix corrupted data

Before introducing the correct methods, let's take a look at the serious consequences of the so-called "correct methods" circulating on the Internet.

Error Method 1

In terms of syntax or literal meaning, alter table... CHARSET = xxx is undoubtedly the best medicine for garbled text! In fact, it does not help you with corrupted data, and even the default character set of the columns already created in the table cannot be changed. Let's look at the example below.

Master [localhost] {msandbox} (test)> show create table charset_test;
+ -------------- + -------------------------------- +
| Table | Create Table |
+ -------------- + -------------------------------- +
| Charset_test | create table 'charset _ test '(
'Id' int (11) not null AUTO_INCREMENT,
'Char _ col' varchar (50) default null,
Primary key ('id ')
) ENGINE = InnoDB AUTO_INCREMENT = 3 default charset = latin1 |
+ -------------- + -------------------------------- +
1 row in set (0.00 sec)

Master [localhost] {msandbox} (test)> alter table charset_test charset = gbk;
Query OK, 0 rows affected (0.03 sec)
Records: 0 Duplicates: 0 Warnings: 0

Master [localhost] {msandbox} (test)> show create table charset_test;
+ -------------- + -------------------------------- +
| Table | Create Table |
+ -------------- + -------------------------------- +
| Charset_test | create table 'charset _ test '(
'Id' int (11) not null AUTO_INCREMENT,
'Char _ col' varchar (50) character set latin1 default null,
Primary key ('id ')
) ENGINE = InnoDB AUTO_INCREMENT = 3 default charset = gbk |
+ -------------- + -------------------------------- +
1 row in set (0.00 sec)

It can be seen that this syntax closely modifies the default character set of the table, that is, it only affects the default character set of the columns created later, but does not change the existing columns and data.

Error Method 2

Alter table... Convert to character set... Compared with Method 1, this method is more lethal, because its function is to encode and convert the data of a table. Below is a short excerpt from the document:

To change the table default character set and all character columns (CHAR, VARCHAR, TEXT) to a new character set, use a statement like this:

Alter table tbl_name
Convert to character set charset_name [COLLATE collation_name];

In fact, this syntax is only applicable to tables that are currently not garbled and not saved using the wrong method .. For a table that has produced an encoding error due to an error in/out, it will lead to worse results.

Let's use an actual example to explain what this SQL statement actually does and what it will cause. Assume that we have a table encoded as latin1, and the data stored in the UTF-8 was mistakenly written, but still displayed normally through terminal. This is the case where the preceding error occurs in the wrong section. After using it for a while, we found this error and planned to change the character set encoding of the table to UTF-8 without affecting the normal display of the original data. In this case, using alter table convert to character set has the following consequences:

Master [localhost] {msandbox} (test)> create table charset_test_latin1 (id int primary key auto_increment, char_col varchar (50) charset = latin1;
Query OK, 0 rows affected (0.01 sec)

Master [localhost] {msandbox} (test)> set names latin1;
Query OK, 0 rows affected (0.00 sec)

Master [localhost] {msandbox} (test)> insert into charset_test_latin1 (char_col) values ('this is Chinese ');
Query OK, 1 row affected (0.01 sec)

Master [localhost] {msandbox} (test)> select id, hex (char_col), char_col, char_length (char_col) from charset_test_latin1;
+ ---- + -------------------------- + -------------- + ----------------------- +
| Id | hex (char_col) | char_col | char_length (char_col) |
+ ---- + -------------------------- + -------------- + ----------------------- +
| 1 | E8BF99E698AFE4B8ADE69687 | Chinese | 12 |
+ ---- + -------------------------- + -------------- + ----------------------- +
1 row in set (0.01 sec)

Master [localhost] {msandbox} (test)> alter table charset_test_latin1 convert to character set utf8;
Query OK, 1 row affected (0.04 sec)
Records: 1 Duplicates: 0 Warnings: 0

Master [localhost] {msandbox} (test)> set names utf8;
Query OK, 0 rows affected (0.00 sec)

Master [localhost] {msandbox} (test)> select id, hex (char_col), char_col, char_length (char_col) from charset_test_latin1;
+ ---- + -------------------------------------------------------- + ------------------------------- + --------------------- +
| Id | hex (char_col) | char_col | char_length (char_col) |
+ ---- + -------------------------------------------------------- + ------------------------------- + --------------------- +
| 1 | C3A8C2BFE284A2C3A6CB9CC2AFC3A4C2B8C2ADC3A6E28093E280A1 | ???? Why ?? -?

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.