The ultimate solution of garbled problem in Mysql

The ultimate solution of garbled problem in Mysql _mysql

Last Update:2017-01-19 Source: Internet

Author: User

Tags mysql client

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The reason that MySQL appears garbled

To understand why there are garbled, we must first understand: from the client request, to MySQL storage data, and then the next time from the table to retrieve the client process, which links will be encoded/decoded behavior. In order to better explain this process, Bo master produced two flow charts, respectively corresponding to deposit and take out two stages.
The encoding conversion process that is stored in the MySQL experience

There are 3 encoding/decoding processes (Red arrows) in the image above. Three Red Arrows correspond respectively: client coding, MySQL server decoding, and client encoding conversion to table encoding. Where terminal can be a bash, a Web page, or an app. In this article we assume that bash is our terminal, the user-side input and presentation interface. Each box in the diagram corresponds to the following behavior:

Using input method Input in terminal
Terminal convert to binary stream based on character encoding
Binary streams are transferred to MySQL Server via MySQL client
Server decoding via Character-set-client
Determine if the charset of the character-set-client and the target table is consistent
If inconsistent, perform a single character encoding conversion from Client-charset to Table-charset
Converts a converted character-coded binary stream into a file

The encoding conversion process that takes data out of the MySQL table

The image above has 3 encoding/decoding processes (Red arrows). The three Red arrows in the figure above correspond to: Client decoding display, MySQL server according to character-set-client encoding, table encoding to character-set-client encoded conversion.

Reading binary data streams from a file
Decoding with table character set encoding
Convert data to character-set-client encoding
Using character-set-client encoding for binary streams
Server travels over network to remote client
Client displays query results through bash-configured character encodings

Cause of MySQL garbled
1. The encoding of the corresponding link is inconsistent when depositing and removing

This will cause garbled is obvious. We use the character set number of the three-time codec used in the deposit phase as C1,C2,C3 (figure one from left to right), and the three character sets when taken out are numbered C1 ', C2 ', C3 ' (left to right). So when you deposit it, Bash C1 uses UTF-8 code, and when you take it out, C1 ' We use the Windows terminal (the default is GBK code), and the result is almost certainly garbled. Or in the MySQL when the set names UTF8 (C2), and when the use of the set names GBK (C2 '), then the result is necessarily garbled
2. Three-step coding inconsistency in a single process

That is, in any one of the above diagram in the same direction of three steps, as long as two or more than two of the coding is inconsistent with the possibility of codec error. If there is no lossless encoding conversion between the two character sets of the difference (described in detail below), then there will be garbled characters. For example: Our shell is UTF8 code, MySQL character-set-client configuration became GBK, and table structure is Charset=utf8, then undoubtedly will appear garbled.

Here we will briefly demonstrate this situation:

  master [localhost] {msandbox} (Test) > CREATE TABLE Charset_test_utf8 (ID int primary key auto_increment, Char_col V
  Archar (m) charset = UTF8;
  Query OK, 0 rows affected (0.04 sec) master [localhost] {msandbox} (Test) > Set names GBK; Query OK, 0 rows Affected (0.00 sec) Master [localhost] {msandbox} (Test) > INSERT INTO Charset_test_utf8 (char_c
  OL) VALUES (' Chinese ');
  Query OK, 1 row affected, 1 warning (0.01 sec) master [localhost] {msandbox} (Test) > Show warnings; +---------+------+---------------------------------------------------------------------------+
  | Level | Code |
  message | +---------+------+---------------------------------------------------------------------------+
  | Warning | 1366 |
  Incorrect string value: ' \xad\xe6\x96\x87 ' for column ' Char_col ' at row 1 | +---------+------+---------------------------------------------------------------------------+ 1 row in Set (0.00 SEC) Master [LoCalhost] {Msandbox} (Test) > select Id,hex (char_col), char_col from Charset_test_utf8; +----+----------------+----------+
  | ID | Hex (Char_col) |
  Char_col | +----+----------------+----------+
  | 1 |    e6b6933fe69e83 |????
  |
 +----+----------------+----------+ 1 row in Set (0.01 sec)

About MySQL's compilation/decoding

Since the system is in accordance with the binary stream transmission, it directly to the string of binary flow directly into the table file is good. Why do you have to do two codec operations before storing?

The reason for Client to server codec is that MySQL needs to do syntax and lexical parsing of the incoming binary stream. Without coding parsing and validation, we can't even know if a stream of binary streams is an insert or an update.
The codec for the File to engine is to know the participle in the binary stream. To take a simple example: we want to remove the first two characters of a field from the table, execute a statement that is like Select Left (col,2) from, and the value that the storage engine reads from the file into the column is e4b8ade69687. So this time if we follow GBK to divide this value into e4b8,ade6,9687 three words, and then return the value of the client should be e4b8ade6, if the UTF8 split into e4b8ad,e69687, then should return e4b8ade69687 two words. It can be seen that if you read data from a data file without encoding and decoding, you cannot perform character-level operations inside the storage engine.
About the wrong go wrong

The most common garbled problem in MySQL is the myth that mistakes are misplaced. The so-called wrong way is that the client (web or shell) character encoding and the final table character encoding format is different, but as long as the guarantee to save and take two of the character set code consistent, still can get no garbled output of this phenomenon. However, the error is not valid for any combination of two character set encodings. We assume that the client's encoding is C,mysql table's character set encoding is S. So in order to be able to go wrong, you need to meet the following two conditions:

When MySQL receives the request, the binary stream encoded from C can be lossless when decoded by S
The MySQL return data is that the binary stream encoded from S can be lossless when decoded by C

Lossless Code Conversion

So what is lossy conversion and what is lossless conversion? If we want to convert the character x represented by encoding A to the representation of encoded B, and the glyph set of code B does not have the character X, then we call the conversion to be lossy. So why is there a difference between the two encodings that represent the character set? If you have seen the 10 minutes before the blogger to figure out character sets and character encodings, or if you have a basic understanding of character encoding, you should know that each character set has a limited number of characters to support, and there is a difference between the text that is covered by each character set. The number of characters that UTF8 and GBK can represent is as follows:

GBK a single character encoding value range is: 8140-fefe which does not include **7e, the total number of characters in the 27000
UTF8 a single character encoding, depending on the number of bytes, the value range is as follows:

The number of characters that can be represented by UTF-8 encoding is far exceeding GBK. Then we can easily find a lossy code conversion from UTF8 to GBK. We use the character mapper (see below) to find a character that is obviously not in the GBK encoding table and try to deposit it into a GBK encoded table. And again remove the behavior of viewing lossy conversions
Character information is specific:? Gurmukhi Letter A unicode:u+0a05, UTF-8: E0 A8 85

The specifics of the storage in MySQL are as follows:

  master [localhost] {msandbox} (Test) > CREATE TABLE CHARSET_TEST_GBK (ID int primary key auto_increment, Char_col VA
  Rchar (m) charset = GBK;
  Query OK, 0 rows Affected (0.00 sec) Master [localhost] {msandbox} (Test) > Set names UTF8; Query OK, 0 rows Affected (0.00 sec) Master [localhost] {msandbox} (Test) > INSERT INTO CHARSET_TEST_GBK (char_co
  L) VALUES ('? ');
  Query OK, 1 row affected, 1 warning (0.01 sec) master [localhost] {msandbox} (Test) > Show warnings; +---------+------+-----------------------------------------------------------------------+
  | Level | Code |
  message | +---------+------+-----------------------------------------------------------------------+
  | Warning | 1366 |
  Incorrect string value: ' \xe0\xa8\x85 ' for column ' Char_col ' at row 1 |
   
  +---------+------+-----------------------------------------------------------------------+ 1 row in Set (0.00 sec) master [localhost] {msandbox} (Test) > select Id,hex (Char_col), Char_col,char_length (Char_col) from CHARSET_TEST_GBK; +----+---------------+----------+-----------------------+
  | ID | Hex (Char_col) | Char_col |
  Char_length (Char_col) | +----+---------------+----------+-----------------------+
  | 1 |    3F |?           |
  1 |

 +----+---------------+----------+-----------------------+ 1 row in Set (0.00 sec)

The part of the error occurs when the 3rd step in the decoding is in place. See the following figure in detail

It can be seen inside MySQL if you cannot find a UTF8 character corresponding to the GBK character, it will be converted to an error mark (here is the question mark). Each character set, when implemented, agreed on the behavior and conversion rules that occurred when this happened. For example: if the corresponding character cannot be found in UTF8, then replace the character with the wrong word. (U+FFFD)

So is the conversion between any of the two character set encodings lossy? Not so, the damage depends on the following points:

Whether the converted character is in the two character set at the same time
Whether the target character set can retain its original expression for unsupported characters

As for the 1th, the experiment has just been explained. Here to explain the second factor that causes lossy conversions. We can see from the example just now that the behavior of GBK in handling characters that they cannot represent is: replacing with the wrong ID, that is, 0x3f. Some character sets, such as latin1, retain encoded data from the original character set when they encounter characters that they cannot represent, and skip ignoring the character to process subsequent data. If the target character set has such an attribute, then it is possible to implement the error-in-error out of the first mentioned section.

Let's look at the following example:

master [localhost] {msandbox} (Test) > CREATE TABLE charset_test (ID int primary KEY auto_increment, Char_col varchar (5 0) charset = latin1;
  Query OK, 0 rows affected (0.03 sec)
   
  Master [localhost] {msandbox} (Test) > Set names latin1;
  Query OK, 0 rows Affected (0.00 sec)
   
  Master [localhost] {msandbox} (Test) > INSERT into Charset_test (Char_col) Val UEs (' Chinese ');
  Query OK, 1 row affected (0.01 sec)
   
  Master [localhost] {msandbox} (Test) > select Id,hex (char_col), char_col from CH Arset_test;
  +----+---------------+----------+
  | id | hex (char_col) | Char_col
  | +----+---------------+----------+
  | 2 | e4b8ade69687 | Chinese   |
  +----+---------------+----------+
  2 rows in Set (0.00 sec)

The detailed flow chart is as follows. It can be seen that the encoding inconsistency has actually occurred since the receipt of the MySQL server. However, the Latin1 character set retains the original value because it does not handle any characters that are outside the scope of its presentation. This behavior also makes it possible for the wrong to go wrong.

How to avoid garbled

Understand the above content, to avoid garbled it seems very easy. As long as the "Trinity", that is, the client, MySQL character-set-client,table CharSet three character sets are fully consistent to ensure that there will be no garbled appearance. And for the already garbled, or have suffered lossy transcoding data, how to repair relatively will have some difficulties. In the next section, we describe the specific methods in detail.
How to fix corrupted data that has been encoded

Before we introduce the correct method, we will first popularize the serious consequences that the so-called "correct methods" circulated on the Internet may cause.
Error Method One

From the grammatical or literal point of view: ALTER TABLE ... Charset=xxx is undoubtedly the most like a bag of garbled medicine! In fact, he doesn't have any help with the data you've corrupted, even the default character set that already has the column created by the table cannot be changed. Let's look at the following example

  master [localhost] {msandbox} (Test) > Show create Table charset_test; +--------------+--------------------------------+
  | Table |
  Create Table | +--------------+--------------------------------+
  | Charset_test | CREATE TABLE ' charset_test ' (' id ' int () NOT null auto_increment, ' char_col ' varchar () DEFAULT NULL, PRIMARY
  KEY (' id ')) engine=innodb auto_increment=3 DEFAULT charset=latin1 | +--------------+--------------------------------+ 1 row in Set (0.00 sec) Master [localhost] {msandbox} (test) ;
  ALTER TABLE Charset_test CHARSET=GBK; Query OK, 0 rows affected (0.03 sec) records:0 duplicates:0 warnings:0 master [localhost] {msandbox} (test) ;
  Show CREATE TABLE charset_test; +--------------+--------------------------------+
  | Table |
  Create Table | +--------------+--------------------------------+
  | Charset_test | CREATE TABLE ' charset_test ' (' id ' int () not NULL auto_increment, ' Char_col ' varchar () CHARACTER SET latin1 default NULL, PRIMARY KEY (' id ') engine=innodb auto_increment=3 DEFAULT charset=g
  BK |

 +--------------+--------------------------------+ 1 row in Set (0.00 sec)

It is visible that the syntax tightly modifies the default character set of the table, which affects only the default character set of columns that are created later, and does not change the columns and data that already exist.
Error Method Two

ALTER TABLE ... CONVERT to CHARACTER SET ... is more lethal than method one, because the official interpretation of his role is to encode the data of a table. Here is a short excerpt from the document:

  To the table default character set and all character columns (CHAR, VARCHAR, TEXT) to a new character set, use a St Atement like this:

  ALTER TABLE tbl_name
  CONVERT to CHARACTER SET charset_name [COLLATE collation_name];

In practice, this syntax applies only to tables that are not currently garbled and are not saved by the wrong way out. For a table that has been incorrectly encoded because of a wrong error, a worse result will be produced.

We use a practical example to explain what this SQL actually did and what he would have done. Let's say we have a table with a code that is latin1, and it was previously stored in the UTF-8 data by the wrong entry, but can still be displayed properly through terminal. This is the case of an example in the wrong section above. After a period of use we found this error and intend to change the table's character set encoding to UTF-8 and not affect the normal display of the original data. In this case, the use of ALTER TABLE convert to character set can have such consequences:

master [localhost] {msandbox} (Test) > CREATE TABLE charset_test_latin1 (ID int primary key auto_increment, Char_col V
  Archar (m) charset = latin1;
  Query OK, 0 rows affected (0.01 sec) master [localhost] {msandbox} (Test) > Set names latin1; Query OK, 0 rows Affected (0.00 sec) Master [localhost] {msandbox} (Test) > INSERT into Charset_test_latin1 (char
  _col) VALUES (' This is Chinese '); Query OK, 1 row affected (0.01 sec) master [localhost] {msandbox} (Test) > select Id,hex (char_col), Char_col,char_
  Length (Char_col) from Charset_test_latin1; +----+--------------------------+--------------+-----------------------+
  | ID | Hex (Char_col) | Char_col |
  Char_length (Char_col) | +----+--------------------------+--------------+-----------------------+
  | 1 | e8bf99e698afe4b8ade69687 |          This is Chinese |
  12 | +----+--------------------------+--------------+-----------------------+ 1 row in Set (0.01 sec) Master [localhost ] {Msandbox} (TEST) > Alter Table Charset_test_latin1 convert to character set UTF8;  Query OK, 1 row affected (0.04 sec) records:1 duplicates:0 warnings:0 master [localhost] {msandbox} (TEST) >
  Set names UTF8; Query OK, 0 rows Affected (0.00 sec) Master [localhost] {msandbox} (Test) > select Id,hex (char_col), Char_col,char
  _length (Char_col) from Charset_test_latin1; +----+--------------------------------------------------------+-----------------------------+------------------ -----+
  | ID | Hex (Char_col) | Char_col |
  Char_length (Char_col) | +----+--------------------------------------------------------+-----------------------------+------------------ -----+
  | 1 | c3a8c2bfe284a2c3a6cb9cc2afc3a4c2b8c2adc3a6e28093e280a1 | È???? ˉt??        -?–?          |
  12 | +----+--------------------------------------------------------+-----------------------------+------------------

 -----+ 1 row in Set (0.00 sec)

From this example, we can see that for the data table has been wrong, this command not only does not play "right" effect, but also the data will be completely spoiled, even the binary code of the data has changed.
the right way to a Dump & Reload

This method is relatively stupid, but also better operation and understanding. The simple saying is divided into the following three steps:

Exporting to a file through the wrong way
Modify a new table with the correct character set
To guide the previously exported file back to the new table

Or, for example, using the example above, we use UTF-8 to "mistake" the data into the LATIN1 encoded table. Now you need to modify the table encoding to UTF-8 you can use the following command

  Shell> mysqldump-u root-p-D--skip-set-charset--default-character-set=utf8 Test charset_test_latin1 > data.sql< c3/> #确保导出的文件用文本编辑器在UTF-8 code to view no garbled
  shell> mysql-uroot-p-e ' CREATE TABLE charset_test_latin1 (ID int primary KEY A  Uto_increment, Char_col varchar) charset = UTF8 ' test
  shell> mysql-uroot-p---default-character-set=utf8 test < Data.sql

The correct approach two convert to Binary & convert back

This method is more tricky, with the use of binary data as intermediate data to achieve the practice. Because MySQL will have encoded data stream, the conversion of the binary data without coding meaning does not do the actual data conversion. When the binary data is changed to the data with encoding, the target encoding is used to make a code conversion check. Through these two characteristics is equivalent in the MySQL internal simulation of a "wrong", will be garbled "correct".

Or, for example, using the example above, we use UTF-8 to "mistake" the data into the LATIN1 encoded table. Now you need to modify the table encoding to UTF-8 you can use the following command

  mysql> ALTER TABLE charset_test_latin1 MODIFY COLUMN char_col VARBINARY (m);
  mysql> ALTER TABLE charset_test_latin1 MODIFY COLUMN char_col varchar (m) Character set UTF8;

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More