Mysql Character Set knowledge summary, mysql Character Set Summary
Character Set & character encoding
Character set is a collection of multiple characters. There are many types of Character sets. Each Character set contains different numbers of characters. The characters here can be English or Chinese characters, or other Chinese characters.
Common Character sets include: ASCII character set, LATIN1 character set, GB2312 Character Set, GBK character set, GB18030 character set, and Unicode Character Set. The character encoding method is to use one or more bytes to represent a character in the character set. Each character set has its own unique encoding method. Therefore, different binary codes are generated for the same character set in different encoding methods. ASCII is a set of character sets based on the Roman alphabet. It uses a 1-Byte Low 7 characters to indicate characters, and the high position is always 0. Compared with the ASCII character set, the LATIN1 character set still uses one byte to represent characters, but the high level is enabled to extend the character set's representation range. The GB2312, GBK, and GB18030 character sets support Chinese characters. The Character Set range is GB2312 <GBK <GB18030. The characters in the GBK character set are encoded in two bytes. For-7 F, the characters are consistent with ASCII characters, and the Chinese characters are expressed in two bytes. The first byte ranges from 81-FE to avoid conflicts with 00-7F. The Unicode Character Set is an industry standard in the computer science field and supports text characters in all countries. Unicode Character Set has several encoding methods, such as common UTF-8, UTF-16, utf-32 and so on. Utf8 uses 1-4 bytes to indicate characters, UTF-16 uses fixed 2 bytes, and utf-32 uses 4 bytes to store.
Mysql and Character Set
As long as the text is involved, there will be character sets and encoding methods. For the mysql database system, you can input an SQL statement from the mysql client, pass it to the mysql server process through TCP/IP, and finally store the file to the server, character storage is involved in each step. Character Set encoding is involved in character storage. The system variables provided by mysql can be seen. The following table lists the system variables and meanings of mysql Character Set settings:
Variable name |
Description |
Character_set_server |
Default internal operation Character Set |
Character_set_client |
Character set used by client source data |
Character_set_connection |
Connection layer Character Set |
Character_set_results |
Query Result Character Set |
Character_set_database |
Default Character Set of the currently selected Database |
Character_set_system |
System metadata (field name, etc.) Character Set |
Mysql character encoding and conversion process
If the preceding system variables are set differently, for example, character_set_client is UTF8 and character_set_database is GBK, encoding conversion is required. So what is the principle of Character Set conversion? Assume that the GBK character set's character string "James" needs to be converted to UTF8 Character Set storage. Actually, for each Chinese Character in the "James" string, go to the UTF8 encoding table to query the corresponding binary, and then store it, encoding conversion does not involve complex algorithms. Mysql Character Set conversion involves the following steps:
1) convert data from character_set_client settings to character_set_connection settings;
2) convert character_set_connection to Character Set settings of table fields;
3) convert the operation result from the table field character set to character_set_results settings.
Next I will describe the character set conversion process through a common scenario. Run the mysql Command Line (for remote connection: SecureCRT) and run "insert into T values (1, 'xiaoming ')", the string 'xiaoming 'stores binary content during the transfer process.
A) The user uses the utf8 character set as the client, character_set_client = gbk, character_set_connection = gbk, and table T uses the gbk character set.
Because character_set_client, character_set_connection, and the table character set are both GBK, encoding conversion is not involved. Therefore, although the table is a character set, although it is GBK, the "James" encoding is not a binary stream of GBK encoding, but a binary stream of UTF8. The two Chinese characters occupy 6 bytes, reading is a reverse process that does not involve encoding conversion. The query still returns "James" correctly ".
B) In case of a), change character_set_client to utf8 to query the inserted value.
We can see that the returned value is "", because the table's character set is GBK, and the client request is UTF8, then the server converts the GBK Chinese character "" corresponding to the binary stream E5B08FE6988E to the binary stream E7818FE5BF94E6A791 corresponding to the UTF8 Chinese character. Therefore, the query result is displayed as "" in SecureCRT ", this is what we call garbled characters.
C) in B), set the SecureCRT character set to GBK to see the effect of SecureCRT Character Set settings on the result.
We can see that another group of characters "audio streams and audio streams" is returned. The entire transfer process is the same as that of B), but in the first step, the byte stream conversion occurs, and SecureCRT character set encoding is set, the display mode is changed.
SQL statements related to character sets
1) view character set encoding settings
SHOW VARIABLES LIKE ‘%CHARACTER%’
2) Set character set encoding
SET NAMES xxx;
This statement is equivalent to setting the character set of the client. It mainly contains three system variables: character_set_client, character_set_connection, and character_set_results.
3) modify the database Character Set
ALTER DATABASE DATABASENAME CHARACTER SET XXX;
This statement only modifies the character set of the database, which affects the default definition of the tables subsequently created. The character set of the created tables is not affected.
4) modify the character set of a table
ALTER TABLE TABLENAME CHARACTER SET XXX;
This statement only modifies the character set of the table and affects the default definition of new columns in the table. The character set of existing columns is not affected.
ALTER TABLE TABLENAME CONVERT TO CHARACTER SET XXX;
This statement modifies both the table Character Set and the existing column character set, and encodes and converts the existing data.
5) modify the column Character Set
ALTER TABLE `TABLE_NAME` MODIFY COLUMN `COLUMN_NAME` CHARACTER SET xxx
6) query the binary encoding of Characters
SELECT HEX(COL_NAME) FROM TABLE_NAME;SELECT LENGTH(COL_NAME) FROM TABLE_NAME;
For a GBK table, if a character occupies 3 bytes and the ratio is 1, the character set must be set in a specific stage. Figure 1 shows that the client is UTF8, both mysqlclient and database are caused by GBK.
Default mysql Character Set latin1
Mysql 4.x uses the latin1 character set (also known as ISO-8859-1) by default, latin1 character set encoding method using a single byte encoding. Let's leave a question: is there a problem in writing and reading Chinese characters to the latin1 character set table? The answer is that there is no problem with proper settings. Assume that SecureCRT is UTF8, character_set_client, and the table character set are both set to latin1. For more information, see section 3rd. In this case, character set encoding conversion is not involved in data reading and writing, convert UTF8 Chinese characters into binary streams and write them to the database. After extraction, secureCRT decodes the corresponding binary data into the corresponding Chinese characters, so this does not affect user usage. However, if character_set_client and character_set_connection are inconsistent with the table Character Set settings, garbled characters may occur.