MySQL character set encoding

Source: Internet
Author: User

MySQL character set encoding
MySQL character set encoding Summary

I shared an article on the mysql character set on my internal blog. I have not studied the mysql character set before. After seeing this article, I am somewhat confused, I went to the mysql official documentation and summarized this blog post by referring to Kaige's article. this article mainly sorts out common character set problems in mysql. If any errors occur, please correct them.

1. MySQL character set encoding

When talking about character sets, it will always be related to encoding. For theoretical knowledge about character sets and encoding, see my previous article. mySQL supports multiple character sets. The character set and encoding concepts are not strictly distinguished here. at the same time, different levels of MySQL have different character set encoding formats, mainly including four levels: Server, database, table and column. character Set encoding not only affects data storage, but also the interaction between client programs and databases. run the show session variables like '% character %' command in mysql to view the following character sets:

+ -------------------------- + -------------------------------------------------------- +

| Variable_name | Value |

+ -------------------------- + -------------------------------------------------------- +

| Character_set_client | utf8 |

| Character_set_connection | utf8 |

| Character_set_database | latin1 |

| Character_set_filesystem | binary |

| Character_set_results | utf8 |

| Character_set_server | latin1 |

| Character_set_system | utf8 |

| Character_sets_dir |/usr/local/mysql-5.6.15-osx10.7-x86_64/share/charsets/

The character set in mysql corresponds to a default COLLATION. Of course, a character set may also correspond to multiple COLLATION rules, but two different character sets cannot match the same one. because I usually use the default checking rules, I will ignore them. I will make up for new discoveries later.

Next let's take a look at the meanings of the character set variables listed in the above command

  • Character_set_client: the character set used by the server to parse client SQL statements. (The character set for statements that arrive from the client. the session value of this variable is set using the character set requested by the client when the client connects to the server ).
  • Character_set_connection: Character Set of the string literal value (literal strings.
  • Character_set_results: The character set encoding of the query result or error message returned by The server to the client. (The character set used for returning query results such as result sets or error messages to the client)
  • Character_set_system: This is the encoding used by the mysql server to store metadata, usually utf8. Do not modify it.
  • Character_sets_dir: the directory for storing mysql character set encoding.
  • Character_set_filesystem: This is the character set encoding of the file system. It is mainly used to parse the character strings used for file names, such as load data infile and SELECT... into outfile and other statements and LOAD_FILE () functions. before opening the file, the file name is converted from character_set_client to the encoding specified by character_set_filesystem. the default value is binary, which means no conversion is performed. for example, if character_set_client = GBK and character_set_filesystem is the default value, SELECT... into outfile "file name", the file name is GBK encoded. if character_set_filesystem = UTF8 is set, the exported file name is UTF-8 encoded. for example, my terminal code is UTF8, and the system default language and code is zh_CN.UTF8. me One database named test has a table named t1 encoded as latin1. In addition, I executed set names gbk on the mysql client. If I do not modify the value of character_set_filesystem, execute SELECT * FROM t1 into outfile 'file 1'. You can find that a file named "File 1" is generated under the corresponding directory. What is the file name encoding? In fact, there are several points to note. First, the native encoding of "File 1" in our SQL statement is terminal encoding UTF8, that is, '\ xe6 \ x96 \ x87 \ xe4 \ xbb \ xb61', and the export data statement SELECT * FROM t1 INTO outfile' test file', according to the previous statement, because character_set_filesystem is binary, '\ xe6 \ x96 \ x87 \ xe4 \ xbb \ xb61' is decoded according to GBK encoding, in this case, the file name is '\ xe6 \ x96 \ x87 \ xe4 \ xbb \ xb61', so that the file name will not be garbled in the zh_CN.UTF8 system. if character_set_filesystem = UTF8 is set, native '\ xe6 \ x96 \ x87 \ xe4 \ xbb \ xb61' will first be decoded according to GBK and then encoded using UTF8, the final result is "\ xe9 \ x8f \ x82 \ xe5 \ x9b \ xa6 \ xe6 \ xac \ xa21", and the file name will be garbled. so it is better not to modify this variable. Use the default value as OK.
  • Character_set_server: the server's default character set encoding. If no encoding is specified during database creation, character_set_server is used to specify the encoding.
  • Character_set_database: character set encoding of the default database. if no default database exists, the variable value is the same as character_set_server. in fact, this value represents the encoding of your current database. For example, if you use "use test" and the test database is encoded as latin1, this value is latin1. when you switch to "use test2", the character_set_database value is the database test2 encoding.2. MySQL character set encoding level

    The first part mainly summarizes the description of character set encoding in the MySQL document. This section mainly describes the MySQL character set encoding level: Server-database-table-field.

    Simply put, the server code is specified by character_set_server. you can specify the encoding when creating a database. If no encoding is specified, the encoding specified by character_set_server is used. for example, we use "create database t1 character set gbk". Here we specify the database t1 Encoding As gbk, so the character_set_server encoding is not used. if we use "create database t2", we can see that t2 is encoded as character_set_server through "show create database t2.

    Similarly, the mysql table can also have its own encoding, which can be specified during table creation. If not specified, the database encoding is used by default. for example, if we create a table t11 and "create table t11 (I int) character set utf8" in the previous database t1, the table t11 is encoded as utf8, if no encoding is specified, the encoding is gbk of database t1.

    In addition, fields in the mysql table can also be encoded. If no field encoding is specified, the field encoding is consistent with the table encoding.

    3. MySQL connection Character Set

    Basically, the encoding content mentioned above will not produce garbled characters. The variables character_set_client, character_set_connection, and character_set_results are easy to generate garbled characters in mysql. you can use the set names utf8 or charset utf8 command to set these three parameters at a time.

    I did not understand these variables when I first came into contact with them. Later I found a lot of information, so I understood a little. Of course, it may be wrong, because I have not read the mysql source code, please kindly advise on the specific principle.

    From the explanation in this document, mysql connection Character Set conversion mainly includes the following three steps:

    • 1. character_set_client is the encoding of the SQL statement sent by the client. Because the server itself does not know the encoding of the client's SQL statement, this variable is used as the initial encoding of the client's SQL statement. after receiving the SQL statement, the server converts the SQL statement to the encoding specified by character_set_connection.Latin1 orUtf8, this step will not be converted). After the conversion is completed, the SQL statement will be actually executed.
    • 2. Before performing internal operations, convert the data in the SQL statement from character_set_connection to the encoding of the corresponding fields in the data table.
    • 3. Convert the operation result from the internal character set encoding to character_set_results encoding.

      The detailed conversion process is as follows:

      Client program sends SQL statement

      |

      | Encoding: A, defined as "character_set_client"

      V

      MySQL server-Convertion from encoding A to encoding B

      |

      | Encoding: B, defined as "character_set_connection"

      V

      MySQL server-Execution to store data

      MySQL server-Conversion from encoding B to encoding C

      |

      | Encoding: C, defined by text column encoding

      V

      MySQL server-Storage

      ...

      MySQL server-Storage

      |

      | Encoding: C, defined by text column encoding

      V

      MySQL server-Execution to fetch data

      MySQL server-Convertion from encoding C to encoding D

      |

      | Encoding: D, defined as "character_set_results"

      V

      Client program es result set

      Next, let's analyze the possible garbled mysql and the reasons I think. If not, please point out.

      4. MySQL garbled instance analysis4.1 problematic instances

      We create a test database db1, database code for latin1, note that currently my machine Terminal code for zh_CN.UTF-8, database encoding settings as shown in section 1st below, create a table test in db1. The SQL statement is as follows:

      Create table 'test '(

      'Gbk' varchar (2) character set gbk default null,

      'Utf8' varchar (2) character set utf8 default null,

      'Latin _ utf8' varchar (6) DEFAULT NULL

      ) ENGINE = InnoDB default charset = latin1;

      Note that the encoding of our table is latin1, and the three fields in the table are different in encoding, including gbk encoding, utf8 encoding, and latin1 encoding. the reason for this creation is to verify the conversion process of mysql character set encoding. now, the point is:

      Mysql> insert into test values ("Chinese", "Chinese", "Chinese ");

      Query OK, 1 row affected, 1 warning (0.00 sec)

      If the mysql package is installed, you can test it. If the strict mode is not enabled for mysql, this insert statement will report a warning as follows:

      Mysql> show warnings;

      + --------- + ------ + Response +

      | Level | Code | Message |

      + --------- + ------ + Response +

      | Warning | 1366 | Incorrect string value: '\ xE4 \ xB8 \ xAD \ xE6 \ x96 \ x87' for column 'Latin _ utf8' at row 1 |

      + --------- + ------ + Response +

      We can first select to see the content in the test table:

      Mysql> select * from test;

      + -------- + ------------ +

      | Gbk | utf8 | latin_utf8 |

      + -------- + ------------ +

      | Chinese | ?? |

      + -------- + ------------ +

      We can also view the actual content stored in the test table:

      Mysql> select hex (gbk), hex (utf8), hex (latin_utf8) from test;

      + ---------- + -------------- + ----------------- +

      | Hex (gbk) | hex (utf8) | hex (latin_utf8) |

      + ---------- + -------------- + ----------------- +

      | D6D0CEC4 | E4B8ADE69687 | 3F3F |

      + ---------- + -------------- + ----------------- +

      You can find that the latin_utf8 field is garbled when you directly view the select statement, and the hex function shows that there is a problem with the content stored in the original latin_utf8 field. the cause of this problem is that an error occurs during the encoding conversion process. Analyze the entire encoding conversion process based on the previous principles:

      • First, the mysql client sends the insert statement insert into test values ("Chinese", "Chinese", "Chinese"). Note that the encoding of "Chinese" is related to our environment, I am zh_CN.UTF-8 here, so "Chinese" byte is represented as \ xE4 \ xB8 \ xAD \ xE6 \ x96 \ x87.
      • The server receives this statement as utf8 encoding. Because character_set_client = utf8, the first step is to convert the statement from character_set_client to character_set_connection encoding, because these two codes are the same here, they will not actually be converted (In addition, if there isLatin1 orUtf8 and other introducer tags will not be converted because the introducer mark already specifies the encoding of the nominal character ).
      • Next, the data will be stored in the database. At this time, the encoding of the three fields to be inserted is the original code \ xE4 \ xB8 \ xAD \ xE6 \ x96 \ x87, at this time, the second encoding conversion occurs, that is, character_set_connection encoding is converted to the encoding specified by the data table field. next, we can see that the UTF-8 encoding is the same as the field UTF8, and no conversion is required. next, let's look at the gbk field. Its Encoding is gbk. Then, the original Encoding s = "\ xE4 \ xB8 \ xAD \ xE6 \ x96 \ x87" will be converted to GBK encoding according to utf8 encoding, that is, execute s. decode ('utf8 '). encode ('gbk'), so the D6D0CEC4 is stored, and there is no problem. finally, the latin_utf8 field also needs to be converted. Because latin1 cannot represent the utf8 encoding range, s. decode ('utf8 '). the conversion process of encode ('latin1') is incorrect, and the result is that the latin_utf8 field is saved. What is stored ??, That is, 3F3F.
      • Finally, it is the result analysis returned by the select statement. This is the third place to convert the encoding of the field from the field encoding to the encoding specified by character_set_results. this is why both the gbk field and the utf8 field can display Chinese characters normally, because when the result is returned, the gbk field will pass through '\ xD6 \ xD0 \ xCE \ xC4 '. decode ('gbk '). encode ('utf8') is returned, so that the gbk field can be normally displayed on the mysql client encoded with utf8. similarly, because the utf8 field itself and character_set_results do not undergo encoding conversion, \ xE4 \ xB8 \ xAD \ xE6 \ x96 \ x87 is returned as is, so it can be displayed normally. the latin_utf8 field stores 3F3F, which is then encoded and converted. Although utf8 encoding is compatible with latin1, its encoding is 3F3F, so the final result is "?? ".4.2 Solutions

        This section describes the problems in section 4.1. Based on the above analysis, we do not reset character_set_client and character_set_connection to ensure that the latin_utf8 field in the test table can be inserted normally, there is a good way to add introducer. For more information about introducer, see the mysql official documentation. then our insert statement is changed

        Mysql> insert into test values ("Chinese", "Chinese", _ latin1 "Chinese ");

        Query OK, 1 row affected (0.02 sec)

        Because the introducer of the latin_utf8 field is specified as _ latin1, the first conversion from character_set_client to character_set_connection will ignore latin_utf8, so the original utf8 character remains unchanged, next, store it in the latin1 field, and there is no problem because the encoding is the same and no conversion is required, therefore, the latin_utf8 field is still stored in \ xE4 \ xB8 \ xAD \ xE6 \ x96 \ x87. this can be verified by the following command:

        Mysql> select hex (gbk), hex (utf8), hex (latin_utf8) from test;

        + ---------- + -------------- + ----------------- +

        | Hex (gbk) | hex (utf8) | hex (latin_utf8) |

        + ---------- + -------------- + ----------------- +

        | D6D0CEC4 | E4B8ADE69687 | 3F3F |

        | D6D0CEC4 | E4B8ADE69687 | E4B8ADE69687 |

        + ---------- + -------------- + ----------------- +

        So if we directly select the query, will there be errors? The answer is yes, because the field encoding will be converted to character_set_results encoding during the query, obviously there is no problem with the gbk and utf8 fields, but for the latin_utf8 fields, the value will pass through s. decode ('latin1 '). encode ('gbk'), causing garbled Characters During query.

        Mysql> select * from test;

        + -------- + ---------------- +

        | Gbk | utf8 | latin_utf8 |

        + -------- + ---------------- +

        | Chinese | ?? |

        | Chinese | ?? -? -? |

        + -------- + ---------------- +

        2 rows in set (0.01 sec)

        The solution is also simple, that is, adding the binary mark before the field in the select statement indicates that the query result of this field does not need to be converted by character_set_results:

        Mysql> select gbk, utf8, binary latin_utf8 from test;

        + -------- + ------------------- +

        | Gbk | utf8 | binary latin_utf8 |

        + -------- + ------------------- +

        | Chinese | ?? |

        | Chinese |

        + -------- + ------------------- +

        2 rows in set (0.00 sec)

        5. Summary

        The mysql encoding system is complex. According to the principles and test results, character_set_client must be consistent with the incoming data encoding. Otherwise, garbled characters may occur. character_set_connection may be different from character_set_client, however, it is recommended that you avoid other problems. in addition, if you have requirements for the result encoding, set the character_set_results encoding. Of course, I personally think the three encoding is the most convenient. in addition, if the data table field encoding uses latin1 encoding, there may be some problems with like search. It is best to set reasonable field encoding as needed.

        I have summarized these points, and the time is also very hasty. I may not understand the point, please also point out. of course, I would like to thank Kaige. It was Kaige's first blog that asked me to study mysql encoding. I will continue to update this article later.

        6. References
        • Mysql character set
        • In-depth mysql Character Set settings
        • MySQL-Sending-Non-ASCII-Text-Big5
        • Viewing-Character-Set-Variables

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.