Understand and solve MySQL garbled problem

Source: Internet
Author: User
Tags mysql client

Why MySQL is garbled


To understand why garbled, we have to understand: from the client initiated the request, to the MySQL storage data, and the next time from the table to retrieve the client's process, which links will have encoding/decoding behavior. To better explain the process, the blogger produced two flowcharts, corresponding to two stages of deposit and withdrawal.

Encoding conversion process stored in MySQL experience



There are 3 encoding/decoding processes (Red arrows) in the Three Red Arrows correspond: client-side encoding, MySQL server decoding, client-coded conversion to table encoding. Where terminal can be a bash, a Web page or an app. In this article we assume that bash is our terminal, the user-side input and presentation interface. Each box in the diagram corresponds to the following behavior:

Using IME input in terminal
Terminal conversion from character encoding to binary stream
Binary stream is transferred to MySQL Server via MySQL client
Server decoding via Character-set-client
Determine if the charset of the character-set-client and the target table are consistent
One-time character encoding conversion from Client-charset to Table-charset if inconsistent
Depositing the converted character encoding binary stream into a file

The encoding conversion process for extracting data from a MySQL table


There are 3 encoding/decoding processes (Red arrows). The three Red Arrows correspond to the following: Client decoding shows, MySQL server based on character-set-client encoding, table encoding to character-set-client encoded conversion.

Read binary data streams from a file
Decoding with the table character set encoding
converting data to character-set-client encoding
Using character-set-client encoding as a binary stream
Server is transmitted over the network to the remote client
Client displays query results with bash-configured character encoding

Causes of MySQL garbled

1. Inconsistent encoding of the corresponding link during deposit and removal


It is obvious that this will cause garbled characters. We put the three-time codec used in the deposit phase to the character set number c1,c2,c3 (figure one from left to right), and three character sets are numbered C1 ', C2 ', C3 ' (left to right) when taken out. So when you deposit bash C1 with the UTF-8 code, when taken out, C1 ' We use the Windows terminal (the default is GBK encoding), then the result is almost certainly garbled. Or when it is stored in MySQL set names UTF8 (C2), and when it is removed using the set names GBK (C2 '), then the result must be garbled

2. Three-step coding inconsistency in a single process


That is, in any of the above image in the same direction of the three steps, as long as two or more than two of the code is inconsistent, there is a possibility of codec error. If a lossless encoding conversion between the two character sets of the difference is not possible (described in more detail below), garbled characters are bound to occur. For example: Our shell is UTF8 code, MySQL character-set-client configured to GBK, and the table structure is Charset=utf8, then there will undoubtedly be garbled.
So here's a quick demonstration of the situation.

master [localhost] {msandbox} (Test) > CREATE TABLE Charset_test_utf8 (ID int primary key auto_increment, Char_col Varc Har (()) charset = UTF8;
Query OK, 0 rows affected (0.04 sec)

master [localhost] {msandbox} (Test) > Set names GBK;
Query OK, 0 rows Affected (0.00 sec)

master [localhost] {msandbox} (Test) > INSERT into Charset_test_utf8 (Char_col) VALUES (' Chinese ');
Query OK, 1 row affected, 1 warning (0.01 sec)

master [localhost] {msandbox} (Test) > Show warnings;
+---------+------+---------------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+---------------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: ' \xad\xe6\x96\x87 ' for column ' Char_col ' at row 1 |
+---------+------+---------------------------------------------------------------------------+
1 row in Set (0.00 sec)

master [localhost] {msandbox} (Test) > select Id,hex (char_col), char_col from Charset_test_utf8;
+----+----------------+----------+
| ID | Hex (Char_col) | Char_col |
+----+----------------+----------+
| 1 |        e6b6933fe69e83 |???? |
+----+----------------+----------+
1 row in Set (0.01 sec)

About MySQL compilation/decoding

Since the system is transmitted in accordance with the binary stream, it is good to directly deposit this binary stream directly into the table file. Why do I have to do two codecs before storing them?

The reason for the codec of Client to server is that MySQL needs syntax and lexical parsing of the incoming binary stream. Without encoding parsing and checking, we don't even know if a string of binary streams is insert or update.


The codec of File to engine is to know the word segmentation in binary stream. A simple example: we want to take the first two characters of a field from the table, execute a statement like Select Left (col,2) from table, and the storage engine reads the value of the column from the file e4b8ade69687. So this time if we follow GBK to divide this value into e4b8,ade6,9687 three words, and then return the value of the client should be e4b8ade6, if the UTF8 divided into e4b8ad,e69687, then you should return e4b8ade69687 two words. It can be seen that character-level operations cannot be performed inside the storage engine if the data is not encoded after it is read from the data file.

About the wrong in wrong

The most common garbled problem in MySQL is the mistake of mistaken myth. The so-called error is that the client (web or shell) character encoding and the final table character encoding format is different, but as long as the memory and take two times the character set encoding consistent can still get no garbled output of this phenomenon. However, error-in errors are not valid for any combination of two character set encodings. We assume that the encoding of the client is c,mysql the character set encoding of the table is S. The following two conditions are required in order to be able to err in error

When MySQL receives a request, the binary stream from the C encoding is decoded by S to lossless
The MySQL return data is that the binary stream from S encoding can be lossless when decoded by C

Coded lossless conversion

So what is lossy conversion and what is lossless conversion? Suppose we want to convert the character x represented by the encoded a into the representation of the encoded B, and the glyph set of B is not the x character, then we call this conversion lossy. So why would there be two encodings that could represent the difference in the character set? If you've seen the 10 minutes before the blogger about the character set and character encoding, or if you have a basic understanding of character encoding, you should know that the number of characters supported by each character set is limited, and there are differences between the text that each character set covers. The range of characters that UTF8 and GBK can represent are as follows

GBK the value range of a single character encoding is: 8140-fefe which does not include **7e, the total number of characters is around 27000

UTF8 A single character is encoded, the range of values is the following table, depending on the number of bytes:

Because the number of characters UTF-8 encoding can represent far more than GBK. Then we can easily find a lossy encoding conversion from UTF8 to GBK. We use the character mapper (see) to find a character that is clearly not in the GBK encoding table and try to deposit it into the GBK encoded table. and remove again to view the behavior of the lossy transformation
Character information is specifically:? Gurmukhi Letter A unicode:u+0a05, UTF-8: E0 A8 85

The specific conditions stored in MySQL are as follows:

master [localhost] {msandbox} (Test) > CREATE TABLE CHARSET_TEST_GBK (ID int primary key auto_increment, Char_col Varch AR (()) charset = GBK;
Query OK, 0 rows Affected (0.00 sec)

master [localhost] {msandbox} (Test) > Set names UTF8;
Query OK, 0 rows Affected (0.00 sec)

master [localhost] {msandbox} (Test) > INSERT into CHARSET_TEST_GBK (Char_col) VALUES ('? ');
Query OK, 1 row affected, 1 warning (0.01 sec)

master [localhost] {msandbox} (Test) > Show warnings;
+---------+------+-----------------------------------------------------------------------+
| Level | Code | Message |
+---------+------+-----------------------------------------------------------------------+
| Warning | 1366 | Incorrect string value: ' \xe0\xa8\x85 ' for column ' Char_col ' at row 1 |
+---------+------+-----------------------------------------------------------------------+
1 row in Set (0.00 sec)

master [localhost] {msandbox} (Test) > select Id,hex (Char_col), Char_col,char_length (Char_col) from CHARSET_TEST_GBK ;
+----+---------------+----------+-----------------------+
| ID | Hex (Char_col) | Char_col | Char_length (Char_col) |
+----+---------------+----------+-----------------------+
| 1 |        3F |?                     | 1 |
+----+---------------+----------+-----------------------+
1 row in Set (0.00 sec)

The error section occurs when the 3rd step of the decoding is being taken. See details




Visible inside MySQL If you cannot find the GBK character for a UTF8 character, it will be converted to an error mark (here is the question mark). Each character set, when implemented by the program, internally contracts the behavior and transformation rules that occur when this happens. For example, if the corresponding character cannot be found in the UTF8, then the character will be replaced if the error is not thrown. (U+FFFD)

Is it not true that the conversion between any two character set encodings is lossy? Not so, the conversion is lossy depending on the following points:

Whether the converted characters are in two character sets at the same time
Whether the target character set can retain its original expression for unsupported characters

As for the 1th, I have already explained it through experiments. Here to explain the second factor that caused the lossy conversion. As we can see from the previous example, the behavior of GBK when dealing with characters that it cannot represent is: Replace with an error identifier, or 0x3f. Some character sets, such as Latin1, retain the encoded data of the original character set when they encounter characters they cannot represent, and skip ignoring the character to process the subsequent data. If the target character set has such an attribute, then it is possible to implement the error-in-the-wrong effect that was mentioned at the beginning of this section.


Let's take a look at the following example

master [localhost] {msandbox} (Test) > CREATE TABLE charset_test (ID int primary KEY auto_increment, Char_col varchar (5 0)) charset = Latin1;
Query OK, 0 rows affected (0.03 sec)

master [localhost] {msandbox} (Test) > Set names latin1;
Query OK, 0 rows Affected (0.00 sec)

master [localhost] {msandbox} (Test) > INSERT into Charset_test (Char_col) VALUES (' Chinese ');
Query OK, 1 row affected (0.01 sec)

master [localhost] {msandbox} (Test) > select Id,hex (char_col), char_col from Charset_test;
+----+---------------+----------+
| ID | Hex (Char_col) | Char_col |
+----+---------------+----------+
| 2 | e4b8ade69687 | English |
+----+---------------+----------+
2 rows in Set (0.00 sec)

The specific flowchart is as follows. It can be seen that code inconsistencies have actually occurred since the MySQL server was received. But because the Latin1 character set does not do any processing for characters outside of its stated range, it retains the original value. This behavior makes it possible to make mistakes in the wrong way.

How to avoid garbled characters


Understand the above content, to avoid garbled, it seems very easy. As long as the "Trinity", that is, the client, MySQL character-set-client,table CharSet three character sets exactly the same can guarantee that there will be no garbled appearance. For data that has been garbled, or has suffered lossy transcoding, how to fix it is relatively difficult. In the next section, we describe the specific methods in detail.

How to fix corrupted data that has been encoded

Before we introduce the right method, let's start by popularizing some of the serious consequences of the so-called "right methods" that are circulating on the Internet.

Error method One

In terms of syntax or literal meaning: ALTER TABLE ... Charset=xxx is undoubtedly the most like the best medicine for the treatment of garbage! In fact, he doesn't have any help with the data you've corrupted, and even the default character set that has already created columns for that table cannot be changed. Let's look at the following example

master [localhost] {msandbox} (Test) > Show create Table charset_test;
+--------------+--------------------------------+
| Table | Create Table |
+--------------+--------------------------------+
| Charset_test | CREATE TABLE ' Charset_test ' (
' id ' int (one) not NULL auto_increment,
' char_col ' varchar DEFAULT NULL,
PRIMARY KEY (' id ')
) engine=innodb auto_increment=3 DEFAULT charset=latin1 |
+--------------+--------------------------------+
1 row in Set (0.00 sec)

master [localhost] {msandbox} (TEST) > ALTER TABLE Charset_test CHARSET=GBK;
Query OK, 0 rows affected (0.03 sec)
records:0 duplicates:0 warnings:0

master [localhost] {msandbox} (Test) > Show create Table charset_test;
+--------------+--------------------------------+
| Table | Create Table |
+--------------+--------------------------------+
| Charset_test | CREATE TABLE ' Charset_test ' (
' id ' int (one) not NULL auto_increment,
' Char_col ' varchar (CHARACTER) SET latin1 DEFAULT NULL,
PRIMARY KEY (' id ')
) engine=innodb auto_increment=3 DEFAULT CHARSET=GBK |
+--------------+--------------------------------+
1 row in Set (0.00 sec)

It is visible that the syntax tightly modifies the default character set of the table, which affects only the default character set of columns that are created later, and does not change the columns and data that already exist.


Error Method Two

ALTER TABLE ... CONVERT to CHARACTER SET ... is more lethal than method one, since the official documentation explains his role in encoding and converting data from one table to another. Here is a short excerpt from the document:

The table default character set and all character columns (CHAR, VARCHAR, TEXT) to a new character set, use a St Atement like this:
ALTER TABLE Tbl_name
CONVERT to CHARACTER SET charset_name [COLLATE collation_name];

In fact, this syntax only applies to tables that are not garbled at the moment and are not saved by the wrong way. and for a table that has been coded incorrectly because of a wrong error, it can produce worse results. We use a practical example to explain what this SQL actually did and what he would have done. Let's say we have a table with a code that is latin1, and the data that was previously stored in the UTF-8 by mistake, but is still displayed normally through terminal. That is, the above-mentioned error in the wrong section of the case. After a period of use we found this error, and we intend to change the table character set encoding to UTF-8 and does not affect the normal display of the original data. In this case, the use of ALTER TABLE convert to character set would have such consequences:

master [localhost] {msandbox} (Test) > CREATE TABLE charset_test_latin1 (ID int primary key auto_increment, Char_col VA Rchar (()) charset = Latin1;
Query OK, 0 rows affected (0.01 sec)

master [localhost] {msandbox} (Test) > Set names latin1;
Query OK, 0 rows Affected (0.00 sec)

master [localhost] {msandbox} (Test) > INSERT into Charset_test_latin1 (Char_col) VALUES (' This is Chinese ');
Query OK, 1 row affected (0.01 sec)

master [localhost] {msandbox} (Test) > select Id,hex (Char_col), Char_col,char_length (Char_col) from Charset_test_ Latin1;
+----+--------------------------+--------------+-----------------------+
| ID | Hex (Char_col) | Char_col | Char_length (Char_col) |
+----+--------------------------+--------------+-----------------------+
| 1 | e8bf99e698afe4b8ade69687 |                    This is Chinese | 12 |
+----+--------------------------+--------------+-----------------------+
1 row in Set (0.01 sec)

master [localhost] {msandbox} (TEST) > ALTER TABLE charset_test_latin1 convert to character set UTF8;
Query OK, 1 row affected (0.04 sec)
Records:1 duplicates:0 warnings:0

master [localhost] {msandbox} (Test) > Set names UTF8;
Query OK, 0 rows Affected (0.00 sec)

master [localhost] {msandbox} (Test) > select Id,hex (Char_col), Char_col,char_length (Char_col) from Charset_test_ Latin1;
+----+--------------------------------------------------------+-----------------------------+------------------ -----+
| ID | Hex (Char_col) | Char_col | Char_length (Char_col) |
+----+--------------------------------------------------------+-----------------------------+------------------ -----+
| 1 | c3a8c2bfe284a2c3a6cb9cc2afc3a4c2b8c2adc3a6e28093e280a1 |                    è¿™æ˜¯ä¸æ–‡| 12 |
+----+--------------------------------------------------------+-----------------------------+------------------ -----+
1 row in Set (0.00 sec)

From this example, we can see that the wrong data table, this command not only does not play "wrong" effect, but also completely spoil the data, even the binary code of the data changed.

The right way a Dump & Reload

This method is more stupid, but also better operation and understanding. The simple word is divided into the following three steps:

Exporting to a file by wrong-in-error method
Modify a new table with the correct character set
To import the previously exported files back into the new table

Or with the example above, we use UTF-8 to "mistake" the data into a latin1 encoded table. Now you need to modify the table encoding to UTF-8 you can use the following command

Shell> mysqldump-u root-p-D--skip-set-charset--default-character-set=utf8 Test charset_test_latin1 > Data.sql
#确保导出的文件用文本编辑器在UTF-8 Code View no garbled
Shell> mysql-uroot-p-e ' CREATE TABLE charset_test_latin1 (ID int primary KEY auto_increment, Char_col varchar ()) C Harset = UTF8 ' test
shell> mysql-uroot-p--default-character-set=utf8 Test < Data.sql

The right way two convert to Binary & convert back

This method compares trickery, and uses binary data as the intermediate data to implement. Since MySQL will then have coded data streams, conversion to binary data without coding meaning does not make the actual data conversion. When the binary data is changed to the encoded data, the encoding conversion check is done by the target encoding. Through these two characteristics is equivalent to the MySQL internal simulation of a "wrong out", will garbled "error".

Or with the example above, we use UTF-8 to "mistake" the data into a latin1 encoded table. Now you need to modify the table encoding to UTF-8 you can use the following command

mysql> ALTER TABLE charset_test_latin1 MODIFY COLUMN char_col VARBINARY (50);
mysql> ALTER TABLE charset_test_latin1 MODIFY COLUMN char_col varchar (character set UTF8);

Reference

http://www.psce.com/blog/2015/03/03/mysql-character-encoding-part-2/
http://www.qqxiuzi.cn/zh/hanzi-gbk-bianma.php
http://zh.wikipedia.org/wiki/%E6%B1%89%E5%AD%97%E5%86%85%E7%A0%81%E6%89%A9%E5%B1%95%E8%A7%84%E8%8C%83
http://www.qqxiuzi.cn/zh/hanzi-gbk-bianma.php
http://blog.csdn.net/ws84643557/article/details/6905167

Understand and solve MySQL garbled problem

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.