This article will detail the cause of MySQL garbled and the specific solution
Why MySQL is garbled
To understand why garbled, we have to understand: from the client initiated the request, to the MySQL storage data, and the next time from the table to retrieve the client's process, which links will have encoding/decoding behavior. To better explain the process, the blogger produced two flowcharts, corresponding to two stages of deposit and withdrawal.
Encoding conversion process stored in MySQL experience
There are 3 encoding/decoding processes (Red arrows) in the Three Red Arrows correspond: client-side encoding, MySQL server decoding, client-coded conversion to table encoding. Where terminal can be a bash, a Web page or an app. In this article we assume that bash is our terminal, the user-side input and presentation interface. Each box in the diagram corresponds to the following behavior:
- Using IME input in terminal
- Terminal conversion from character encoding to binary stream
- Binary stream is transferred to MySQL Server via MySQL client
- Server decoding via Character-set-client
- Determine if the charset of the character-set-client and the target table are consistent
- One-time character encoding conversion from Client-charset to Table-charset if inconsistent
- Depositing the converted character encoding binary stream into a file
The encoding conversion process for extracting data from a MySQL table
There are 3 encoding/decoding processes (Red arrows). The three Red arrows correspond to each other: the client decodes the display, MySQL server according to the character-set-client
encoding, the table encoding to encode character-set-client
the conversion.
- Read binary data streams from a file
- Decoding with the table character set encoding
- converting data to character-set-client encoding
- Using character-set-client encoding as a binary stream
- Server is transmitted over the network to the remote client
- Client displays query results with bash-configured character encoding
Causes of MySQL garbled
1. Inconsistent encoding of the corresponding link during deposit and removal
It is obvious that this will cause garbled characters. We put the three-time codec used in the deposit phase to the character set number c1,c2,c3 (figure one from left to right), and three character sets are numbered C1 ', C2 ', C3 ' (left to right) when taken out. So when you deposit bash C1
with UTF-8 code, when it is removed, C1‘
we use the Windows terminal (the default is GBK encoding), then the result is almost certainly garbled. Or when it is stored in MySQL set names utf8
( C2
), and when it is taken out of the set names gbk
( C2‘
), then the result must be garbled
2. Three-step coding inconsistency in a single process
That is, in any of the above image in the same direction of the three steps, as long as two or more than two of the code is inconsistent, there is a possibility of codec error. If a lossless encoding conversion between the two character sets of the difference is not possible (described in more detail below), garbled characters are bound to occur. For example: Our shell is UTF8 encoding, MySQL character-set-client
configuration into a GBK, and the table structure is charset=utf8
, then there will undoubtedly be garbled.
So here's a quick demonstration of the situation.
master [localhost] {msandbox} (Test) > CREATE TABLE Charset_test_utf8 (ID int primary key auto_increment, Char_col Varc Har (()) charset = UTF8; Query OK, 0 rows affected (0.04 sec) master [localhost] {msandbox} (Test) > Set names GBK; Query OK, 0 rows Affected (0.00 sec) Master [localhost] {msandbox} (Test) > INSERT into Charset_test_utf8 (Char_col) Val UEs (' Chinese '); Query OK, 1 row affected, 1 warning (0.01 sec) master [localhost] {msandbox} (Test) > Show warnings;+---------+------+-- -------------------------------------------------------------------------+| Level | Code | Message |+---------+------+----------------------------- ----------------------------------------------+| Warning | 1366 | Incorrect string value: ' \xad\xe6\x96\x87 ' for column ' Char_col ' at row 1 |+---------+------+--------------------------- ------------------------------------------------+1 Row in Set (0.00 sec) Master [localhost] {msandbox} (TEST) > select Id,hex (char_col), char_col from charset_test_utf8;+----+----------------+----------+| ID | Hex (Char_col) | Char_col |+----+----------------+----------+| 1 | e6b6933fe69e83 |???? |+----+----------------+----------+1 row in Set (0.01 sec)
About MySQL compilation/decoding
Since the system is transmitted in accordance with the binary stream, it is good to directly deposit this binary stream directly into the table file. Why do I have to do two codecs before storing them?
- The reason for the codec of Client to server is that MySQL needs syntax and lexical parsing of the incoming binary stream. If we don't do code parsing and verification, we don't even know whether a string of binary streams is coming
insert
or not update
.
- The codec of File to engine is to know the word segmentation in binary stream. As a simple example: we want to take the first two characters of a field from the table, execute a statement like the one that the
select left(col,2) from table
storage engine reads from the file into the column value E4B8ADE69687
. So this time if we gbk this value into E4B8
, ADE6
9687
three words, and then return the value of the client should be, E4B8ADE6
if the UTF8 is divided into E4B8AD
, then you E69687
should return E4B8ADE69687
two characters. It can be seen that character-level operations cannot be performed inside the storage engine if the data is not encoded after it is read from the data file.
About the wrong in wrong
The most common garbled problem in MySQL is the 错进错出
myth. The so-called error is that the client (web or shell) character encoding and the final table character encoding format is different, but as long as the memory and take two times the character set encoding consistent can still get no garbled output of this phenomenon. However, error-in errors are not valid for any combination of two character set encodings. We assume that the encoding of the client is c,mysql the character set encoding of the table is S. The following two conditions are required in order to be able to err in error
When MySQL receives a request, the binary stream from the C encoding is decoded by S to lossless
The MySQL return data is that the binary stream from S encoding can be lossless when decoded by C
Coded lossless conversion
So what is lossy conversion and what is lossless conversion? Suppose we want to convert the character x represented by the encoded a into the representation of the encoded B, and the glyph set of B is not the x character, then we call this conversion lossy. So why would there be two encodings that could represent the difference in the character set? If you've seen the 10 minutes before the blogger about the character set and character encoding, or if you have a basic understanding of character encoding, you should know that the number of characters supported by each character set is limited, and there are differences between the text that each character set covers. The range of characters that UTF8 and GBK can represent are as follows
- GBK the range of values for a single character encoding is:
8140
- FEFE
which is not included **7E
, the total number of characters is around 27000
- UTF8 A single character is encoded, the range of values is the following table, depending on the number of bytes:
Because the number of characters UTF-8 encoding can represent far more than GBK. Then we can easily find a lossy encoding conversion from UTF8 to GBK. We use the character mapper (see) to find a character that is clearly not in the GBK encoding table and try to deposit it into the GBK encoded table. and remove again to view the behavior of the lossy transformation
Character information is specifically:? GURMUKHI LETTER A Unicode: U+0A05, UTF-8: E0 A8 85
The specific conditions stored in MySQL are as follows:
master [localhost] {msandbox} (Test) > CREATE TABLE CHARSET_TEST_GBK (ID int primary key auto_increment, Char_col Varch AR (()) charset = GBK; Query OK, 0 rows Affected (0.00 sec) Master [localhost] {msandbox} (Test) > Set names UTF8; Query OK, 0 rows Affected (0.00 sec) Master [localhost] {msandbox} (Test) > INSERT into CHARSET_TEST_GBK (Char_col) Valu Es ('? '); Query OK, 1 row affected, 1 warning (0.01 sec) master [localhost] {msandbox} (Test) > Show warnings;+---------+------+-- ---------------------------------------------------------------------+| Level | Code | Message |+---------+------+--------------------------------- --------------------------------------+| Warning | 1366 | Incorrect string value: ' \xe0\xa8\x85 ' for column ' Char_col ' at row 1 |+---------+------+------------------------------- ----------------------------------------+1 Row in Set (0.00 sec) Master [localhost] {msandbox} (Test) > select Id,hex (CHar_col), Char_col,char_length (Char_col) from charset_test_gbk;+----+---------------+----------+----------------- ------+| ID | Hex (Char_col) | Char_col | Char_length (Char_col) |+----+---------------+----------+-----------------------+| 1 | 3F |? | 1 |+----+---------------+----------+-----------------------+1 row in Set (0.00 sec)
The error section occurs when the 3rd step of the decoding is being taken. See details
Visible inside MySQL If you cannot find the GBK character for a UTF8 character, it will be converted to an error mark (here is the question mark). Each character set, when implemented by the program, internally contracts the behavior and transformation rules that occur when this happens. For example, if the corresponding character cannot be found in the UTF8, then the character will be replaced ?
(U+FFFD) If it is not thrown incorrectly.
Is it not true that the conversion between any two character set encodings is lossy? Not so, the conversion is lossy depending on the following points:
- Whether the converted characters are in two character sets at the same time
- Whether the target character set can retain its original expression for unsupported characters
As for the 1th, I have already explained it through experiments. Here to explain the second factor that caused the lossy conversion. We can see from the example just now that the behavior of GBK when dealing with characters that it cannot represent is: with 错误标识
substitution, that is 0x3F
. Some character sets, such as Latin1, retain the encoded data of the original character set when they encounter characters they cannot represent, and skip ignoring the character to process the subsequent data. If the target character set has such an attribute, then it is possible to achieve the effect that is mentioned at the beginning of this section 错进错出
.
Let's take a look at the following example
master [localhost] {msandbox} (Test) > CREATE TABLE charset_test (ID int primary KEY auto_increment, Char_col varchar (5 0)) charset = Latin1; Query OK, 0 rows affected (0.03 sec) master [localhost] {msandbox} (Test) > Set names latin1; Query OK, 0 rows Affected (0.00 sec) Master [localhost] {msandbox} (Test) > INSERT into Charset_test (Char_col) VALUES ( ' Chinese '); Query OK, 1 row affected (0.01 sec) master [localhost] {msandbox} (Test) > select Id,hex (char_col), Char_col from CharSet _test;+----+---------------+----------+| ID | Hex (Char_col) | Char_col |+----+---------------+----------+| 2 | e4b8ade69687 | Chinese |+----+---------------+----------+2 rows in Set (0.00 sec)
The specific flowchart is as follows. It can be seen that code inconsistencies have actually occurred since the MySQL server was received. But because the Latin1 character set does not do any processing for characters outside of its stated range, it retains the original value. This behavior makes it possible to make mistakes in the wrong way.
How to avoid garbled characters
Understand the above content, to avoid garbled, it seems very easy. As long as the "Trinity", that is, the client, MySQL character-set-client,table CharSet three character sets exactly the same can guarantee that there will be no garbled appearance. For data that has been garbled, or has suffered lossy transcoding, how to fix it is relatively difficult. In the next section, we describe the specific methods in detail.
How to fix corrupted data that has been encoded
Before we introduce the right method, let's start by popularizing some of the serious consequences of the so-called "right methods" that are circulating on the Internet.
Error method One
No matter from the syntax or the literal meaning: ALTER TABLE ... CHARSET=xxx
It is undoubtedly the most like the medicine of the package of garbage! In fact, he doesn't have any help with the data you've corrupted, and even the default character set that has already created columns for that table cannot be changed. Let's look at the following example
master [localhost] {msandbox} (Test) > Show CREATE TABLE charset_test;+--------------+----------------------------- ---+| Table | Create Table |+--------------+--------------------------------+| Charset_test | CREATE TABLE ' charset_test ' (' id ' int (one) not null auto_increment, ' char_col ' varchar () DEFAULT NULL, PRIMARY KEY (' Id ')) engine=innodb auto_increment=3 DEFAULT charset=latin1 |+--------------+--------------------------------+1 row In Set (0.00 sec) Master [localhost] {msandbox} (TEST) > ALTER TABLE Charset_test CHARSET=GBK; Query OK, 0 rows affected (0.03 sec) records:0 duplicates:0 warnings:0master [localhost] {msandbox} (Test) > show C reate table charset_test;+--------------+--------------------------------+| Table | Create Table |+--------------+--------------------------------+| Charset_test | CREATE TABLE ' charset_test ' (' id ' int (one) not NULL auto_increment, ' char_col ' varchar () CHARACTER SET latin1 DEFAULT NULL, PRIMARY KEY (' id ')) engine=innodb auto_increment=3 DEFAULT charset=gbk |+--------------+--------------------- -----------+1 Row in Set (0.00 sec)
It is visible that the syntax tightly modifies the default character set of the table, which affects only the default character set of columns that are created later, and does not change the columns and data that already exist.
Error Method Two
ALTER TABLE … CONVERT TO CHARACTER SET …
is more lethal than method one, since the official documentation explains his role in encoding and converting data from one table to another. Here is a short excerpt from the document:
The table default character set and all character columns (CHAR, VARCHAR, TEXT) to a new character set, use a St Atement like this:
ALTER TABLE Tbl_name
CONVERT to CHARACTER SET charset_name [COLLATE collation_name];
In fact, this syntax only applies to tables that are currently not garbled and are not 错进错出
saved by means of the method. . And for a table that has been coded incorrectly because of a wrong error, it can produce worse results. We use a practical example to explain what this SQL actually did and what he would have done. Let's say we have a table with a code that is latin1, and the data that was previously stored in the UTF-8 by mistake, but is still displayed normally through terminal. That is, the above-mentioned error in the wrong section of the case. After a period of use we found this error, and we intend to change the table character set encoding to UTF-8 and does not affect the normal display of the original data. In this case alter table convert to character set
the use would have such consequences:
master [localhost] {msandbox} (Test) > CREATE TABLE charset_test_latin1 (ID int primary key auto_increment, Char_col VA Rchar (()) charset = Latin1; Query OK, 0 rows affected (0.01 sec) master [localhost] {msandbox} (Test) > Set names latin1; Query OK, 0 rows Affected (0.00 sec) Master [localhost] {msandbox} (Test) > INSERT into Charset_test_latin1 (Char_col) v Alues (' This is Chinese '); Query OK, 1 row affected (0.01 sec) master [localhost] {msandbox} (Test) > select Id,hex (Char_col), Char_col,char_length ( Char_col) from charset_test_latin1;+----+--------------------------+--------------+-----------------------+| ID | Hex (Char_col) | Char_col | Char_length (Char_col) |+----+--------------------------+--------------+-----------------------+| 1 | e8bf99e698afe4b8ade69687 | This is Chinese | |+----+--------------------------+--------------+-----------------------+1 row in Set (0.01 sec) master [localhost] {Msandbox} (test) > ALTER TABLE charset_test_latin1 Convertto character set UTF8; Query OK, 1 row affected (0.04 sec) records:1 duplicates:0 warnings:0master [localhost] {msandbox} (Test) > Set Nam Es UTF8; Query OK, 0 rows Affected (0.00 sec) Master [localhost] {msandbox} (Test) > select Id,hex (char_col), char_col,char_length (Char_col) from charset_test_latin1;+----+--------------------------------------------------------+------------- ----------------+-----------------------+| ID | Hex (Char_col) | Char_col | Char_length (Char_col) |+----+--------------------------------------------------------+------------------------- ----+-----------------------+| 1 | c3a8c2bfe284a2c3a6cb9cc2afc3a4c2b8c2adc3a6e28093e280a1 | 这是ä¸æ–‡| |+----+--------------------------------------------------------+-----------------------------+--------------- --------+1 Row in Set (0.00 sec)
From this example, we can see that the wrong data table, this command not only does not play "wrong" effect, but also completely spoil the data, even the binary code of the data changed.
The right way a Dump & Reload
This method is more stupid, but also better operation and understanding. The simple word is divided into the following three steps:
- Exporting to a file by wrong-in-error method
- Modify a new table with the correct character set
- To import the previously exported files back into the new table
Or with the example above, we use UTF-8 to "mistake" the data into a latin1 encoded table. Now you need to modify the table encoding to UTF-8 you can use the following command
Shell> mysqldump-u root-p-D--skip-set-charset--default-character-set=utf8 Test charset_test_latin1 > data.sql# Make sure the exported file is UTF-8 encoded with a text editor without garbled shell> mysql-uroot-p-e ' CREATE TABLE charset_test_latin1 (ID int primary KEY Auto_increm ENT, Char_col varchar ()) charset = UTF8 ' testshell> mysql-uroot-p --default-character-set=utf8 Test < DATA.S Ql
The right way two convert to Binary & convert back
This method compares trickery, and uses binary data as the intermediate data to implement. Since MySQL will then have coded data streams, conversion to binary data without coding meaning does not make the actual data conversion. When the binary data is changed to the encoded data, the encoding conversion check is done by the target encoding. Through these two characteristics is equivalent to the MySQL internal simulation of a "wrong out", will garbled "error".
Or with the example above, we use UTF-8 to "mistake" the data into a latin1 encoded table. Now you need to modify the table encoding to UTF-8 you can use the following command
mysql> ALTER TABLE charset_test_latin1 MODIFY COLUMN char_col VARBINARY ();mysql> ALTER TABLE charset_test_latin1 MODIFY COLUMN char_col varchar (character set UTF8;
10 minutes to learn to understand and solve MySQL garbled problem