MySQL character encoding system (II) -- Data transmission encoding
The character encoding system of MySQL can be divided into two parts: one is about how to manage the encoding of character data when the database server stores data tables, and the other is about how to encode the data transmitted between the client and the database server. MySQL character encoding system (1) -- Data storage encoding discusses data storage encoding. This article discusses data transmission encoding.
MySQL clients can be divided into two types: one is the official client written in C language-MySQL command program; the other is the client written by common programmers using JDBC and other ctor APIs. The first type is discussed here.
Windows client
The character encoding processing section of the MySQL command program in Windows and Linux is not equivalent. it is the client character encoding conversion logic in Windows:
Three character variables exist on the server, while charset_info exist on the client.
When the client starts to connect to the server, the client sets charset_info as the specified encoding based on the configuration parameters, and notifies the server to set the three character variables to the same encoding.
Data transmission process
- The client reads a line of command text from the standard input in the console, which is encoded as the operating system code;
- The client transcodes the command from the system to the encoding set by the charset_info variable of the client;
- The client sends the command text to the server;
- The server decodes the received text to character_set_client encoding, which is usually consistent with charset_info of the client;
- The server transcodes the command text to character_set_connection;
- The server executes commands to generate results;
- Transcode the result to character_set_results and send it to the client;
- The client decodes the received result as charset_info encoding, which is usually consistent with character_set_results;
- The client transcodes the result to the operating system code and outputs it to the standard output on the console.
On Windows, the MySQL program uses the Unicode Console Read API when reading the Console. Therefore, the original string obtained by the program from the Console is actually UTF16 encoded, therefore, the "operating system code" here is not a Windows GBK, but should be regarded as UTF16.
Linux client
Is the character encoding conversion logic of the MySQL client in Linux:
It differs from Windows in that it does not forcibly convert the operating system encoding string from the terminal standard input to charset_info encoding, it will not forcibly convert the charset_info encoding result string output to the terminal to the operating system encoding. That is to say, the charset_info variable will be ignored by the MySQL program on Linux. Of course, the data transmission process of the Linux client is several steps less than that of the Windows client.
Garbled Trap simulation
According to the characteristics of the MySQL program on the Linux platform, it is easy to generate a possible trap: after inserting Chinese data into the database through the MySQL client in the Linux system, the query results are not garbled, however, garbled characters are obtained from querying the same table on the MySQL client of the Windows platform with the correct configuration.
We can simulate the above situation as follows:
Create a table that contains only one GBK string field and UTF8 string field. In Linux, start MySQL to connect to the database server, and change the three character variables of the server from the default UTF8 to GBK. Insert Chinese data to the database and select immediately. The result is normal:
However, when you use the Windows MySQL client for query, the results are garbled:
Garbled analysis
In combination with the previous data transmission process, you can know what the problem is:
- The client reads a line of utf8 encoded (Linux default) command text from the terminal, ignores the charset_info variable, and directly sends the text to the server;
- The server sets the three character variables to gbk by running the charset GBK command in advance. Therefore, the server considers the received text to be GBK encoded;
- Then, the server directly saves text strings to the data table without any transcoding, because the first field of the data table is GBK.
So far, the data table has a UTF8 string, while the server queries the string when it is GBK on the same Linux client:
- The strings in the table are directly sent to the client without any transcoding, because character_set_results is also GBK;
- After receiving the query result, the client ignores charset_info and directly outputs it to the terminal standard output without transcoding;
- The data obtained by the terminal is actually UTF-8 encoded, so the data is output normally.
When querying on a Windows client:
- The character string (UTF8) in the table is directly sent to the client without any transcoding, because character_set_results is also GBK;
- After receiving the query result, the client considers it as charset_info encoding (GBK at this time );
- The client transcodes the query result from charset_info to UTF16, and then calls the Unicode Console Write API output to see garbled characters.
Garbled "repair"
If the Windows client also wants to see the correct results, it should be intentionally incorrectly configured:
- Run the charset utf8 command to set charset_info and character to UTF8;
- Run the set names gbk command to set the character of the three servers to GBK;
- Now select, the results do not look garbled.