Basic Concepts
• Character refers to the smallest semantic symbol in human language. For example, 'A' and 'B;
• Given a series of characters, each character is assigned a value, which is used to represent the corresponding character. This value is the character encoding (encoding ). For example, if 'A' is given a value of 0 and 'B' is given a value of 1, 0 is the encoding of 'A;
• Given a series of characters and the corresponding encoding, a set of all these characters and encoding pairs is a character set ). For example, if the given character list is {'A', 'B'}, {'A' => 0, 'B' => 1} is a character set;
• Collation refers to the comparison rules between characters in the same character set;
• Only after confirming the character order can an equivalent character set be defined and the relationship between characters in size be defined;
• Each collation only corresponds to one character set, but one character set can correspond to multiple character sequences, one of which is the default collation );
• Names in the collation of MySQL follow naming conventions: names starting with character sets corresponding to the collation; names starting with _ Ci (Case Insensitive) and _ CS (case sensitive) or end with _ Bin (compare by encoding value. For example, in the collation "utf8_general_ci", "a" and "A" are equivalent;
MySQL Character Set settings
• System variables:
-Character_set_server: Default internal operation Character Set
-Character_set_client: Character set used by client source data
-Character_set_connection: Connection layer Character Set
-Character_set_results: Query Result Character Set
-Character_set_database: Default Character Set of the currently selected Database
-Character_set_system: System metadata (field name, etc.) Character Set
-AlsoCollation _The preceding variables are used to describe the collation.
• Use introducer to specify the character set of the text string:
-Format: [_ charset] 'string' [collate collation]
-Example:
• Select _ Latin1 'string ';
• Select _ utf8 'Hello 'collate utf8_general_ci;
-Text strings modified by introducer are directly converted to internal character sets for processing without extra transcoding during the request.
Character Set conversion process in MySQL
1. MySQL Server converts the request data from character_set_client to character_set_connection when receiving the request;
2. Before performing internal operations, convert the request data from character_set_connection to the internal operation character set. The method is as follows:
• Set the character set value for each data field;
• If the preceding value does not exist, use the default Character Set Value of the corresponding data table (MySQL extension, non-SQL standard );
• If the preceding value does not exist, use the default Character Set Value of the corresponding database;
• If the preceding value does not exist, use character_set_server to set the value.
3. Convert the operation result from the internal character set to character_set_results.
FAQs
• No connection character set is set before UTF-8 encoding is inserted to a data table whose default character set is utf8. During query, the connection character set is set to utf8.
-The default settings of the MySQL server are used during insertion. character_set_client, character_set_connection, and character_set_results are Latin1;
-The data to be inserted will go through the character set conversion process of Latin1 => Latin1 => utf8. During this process, each inserted Chinese character will be changed from the original three bytes to 6 bytes for storage;
-The query result will go through the utf8 => utf8 Character Set conversion process, and the 6 bytes saved will not be returned, resulting in garbled characters ......
• Set the connection character set to utf8 before inserting utf8-encoded data into a data table with the default Character Set Latin1
-During insertion, character_set_client, character_set_connection, and character_set_results are set to utf8;
-The inserted data is converted to the character set utf8 => utf8 => Latin1. If the raw data contains \ u0000 ~ Unicode characters outside the \ u00ff range will be converted to "?" because they cannot be expressed in the Latin1 character set. (0x3f) symbol. In future queries, the content cannot be restored regardless of the character set setting.
Methods for detecting Character Set Problems
• Show character set;
• Show collation;
• Show variables like 'character % ';
• Show variables like 'collation % ';
• SQL functions hex, length, and char_length
• SQL functions charset and collation
Suggestions for using MySQL character sets
• When creating databases/Tables and performing database operations, try to explicitly specify the character set used, instead of relying on the default settings of MySQL. Otherwise, MySQL upgrades may cause great problems;
• Although Latin1 can be used in databases and character sets to solve garbled characters in most cases, the disadvantage is that SQL operations cannot be performed in characters, generally, setting the database and connected character set to utf8 is a good choice;
• When using MySQL c api, you can use mysql_options to set the mysql_set_charset_name attribute to utf8 immediately after initializing the database handle, so that you do not need to explicitly use the set names statement to specify the character set to be connected, when mysql_ping is used to reconnect to a persistent connection, the connection character set is reset to utf8;
• For MySQL PHP APIs, page-level PHP programs generally run for a short period of time. After connecting to the database, you can explicitly use the set names statement to set a character set for connection. However, when using a persistent connection, note that the connection is smooth and the Set names statement is used to explicitly reset the character set after reconnection.
Other considerations
• Default_character_set in my. CNF only affects the connection character set when the MySQL command is used to connect to the server, and does not affect applications that use the libmysqlclient library!
• SQL function operations on fields are generally performed using the internal operation character set, which is not affected by the character set setting of the connection.
• The bare strings in SQL statements will be affected by the connected character set or introducer settings. For comparison and other operations, they may produce completely different results. Be careful!
The ultimate exposure to MySQL garbled characters
Hehe, In the age when the title party was rampant, it was impossible to draw a strong impression without making a sensational title. However, in my personal experience, it is not difficult to create an "ultimate secret" on this issue.
The garbled MySQL problem is probably the most frequently occurring problem at the beginning of system development. If you look at moves (sometimes wrong or wrong), you will always be able to solve the problem on the face, but the hidden dangers will not be exclusive. When may they come out? Someone provides a simple and thorough solution: all database tables are coded in UTF-8, it should have been peaceful, but sometimes there is still a problem.
Is this really complicated? Why is it so hard to tame? In fact, the problem is far simpler than imagined!
The key is to see through one point: the problem does not come out of the encoding methods of databases and tables, as long as the "connection" encoding method is completed.
· Connection encoding method
MySQL is smart enough, so long as you speak accurately with it, it will execute commands well, and you don't have to worry about the details of its internal storage method.
Therefore, you only need to do two things:
1. clearly tell MySQL what encoding method you will use in SQL statements;
2. When sending SQL statements, use the encoding method stated above accurately.
In order to do the first thing, the simplest way is to execute a "set names..." command after a database connection is established.
The second point is related to the specific programming language. For PHP, "string" is a value, which is a "Byte string" organized by certain encoding methods. If it does not meet the requirements, iconv () can be used for conversion. For Java, the string for writing SQL statements must be ucs2 encoded, you can use useunicode and characterencoding in the connection string to tell the driver which encoding method to convert the string value in the statement, or convert the string to the byte [] of the specific encoding method, and then use setbytes () and setbinarystream ()
To the preparedstatement.
As long as the above two points are completed, you will not encounter MySQL garbled issues. Of course, for the entire system, there are other links that may produce garbled characters, such as webpage encoding, which can be easily resolved as long as they are carefully identified.
· Encoding of database tables
As we can see above, to prevent MySQL from garbled characters, the key point is the encoding method of database connection, and the encoding method of database tables is not important. MySQL will automatically complete the encoding conversion.
So how do I select the encoding method when designing a database? In fact, we should consider several other factors at this time, such:
1. Character Set compatibility. If you want to save characters such as simplified, traditional, and Japanese in a field at the same time, You must select an encoding method that can accommodate these character sets at the same time. Generally, UTF-8 is the best choice.
2. storage efficiency. If you expect to save only Simplified Chinese, choosing GBK is more efficient than UTF-8 storage. However, it seems that this factor is not very important in most cases.
3. Reduce the encoding and conversion links during database access based on actual application conditions. As shown in the transmission process, "c Conversion" is the client's conversion (either the application or the driver, such as the JDBC driver ), "s conversion" is the server-side encoding conversion (automatically implemented by MySQL ).
Client Program ----------> client interface ----------> server interface ------------> database storage
Ucs2/GBK /... C Conversion code transmission s conversion UTF-8/GBK /...
Client Program <------------ client interface <---------- server interface <------------ database storage
Ucs2/GBK /... C Conversion code transmission s conversion UTF-8/GBK /...