MySQL character encoding mechanism, Chinese garbled characters and Solutions

Source: Internet
Author: User
Tags mysql manual

Http://hzaufarm.tk/index.php/topic,7.0.html

I believe that many of my friends will keep a close eye on character encoding, but this is a big headache when garbled code occurs. This article, based on the experience of our predecessors and the explanations in the MySQL manual, uses specific operations and examples, it aims to understand the character encoding mechanism of MySQL and solve the garbled problem.

[Symptom]

The webpage XXX. php uses editplus to save it as utf8,
MySQL sets both [cliant] and [mysqld] In my. ini to default-character-set = utf8,
Create Table 'xxx' (myname varchar (255) engine = MyISAM default charset = utf8,
Execute insert, update, and select statements in XXX. php in Chinese,
It seems that there is no problem, but using phpMyAdmin to view the Select Code is garbled. Using a third-party tool software (such as sqlyog) to see the Select Code is also garbled, and mysqldump is also garbled and uncomfortable. Of course, if you select the binary/varbinary/BLOB type during table creation, no garbled characters will be found, because the binary storage is specified, and MySQL does not have the encoding concept when saving data.

[Search for problems]
Although default-character-set = utf8 is set in my. ini, it is found that the following command is executed:

Mysql> show variables like 'character % ';

Program code: [select]

+----------------------------------------+-------------------------
| Variable_name| Value
+----------------------------------------+-------------------------
| character_set_client | latin1
| character_set_connection | latin1
| character_set_database | utf8
| character_set_filesystem | binary
| character_set_results| latin1
| character_set_server| utf8
| character_set_system| utf8
| character_sets_dir| D:\mysql\share\charsets\
+----------------------------------------+-------------------------
8 rows in set (0.00 sec)

Mysql> show variables like 'collation _ % ';

Program code: [select]

+---------------------------------------+------------------
| Variable_name| Value
+---------------------------------------+------------------
| collation_connection | latin1_swedish_ci
| collation_database| utf8_general_ci
| collation_server| utf8_general_ci
+--------------------------------------+------------------
3 rows in set (0.00 sec)

It is found that all values in the value column are utf8, and some are Latin1, such as the client and connection.

The following concepts are mentioned multiple times:
Character_set_client: character set used in the query sent by the client.
Character_set_connection: After the server receives the query, it will convert the encoding identified by the character_set_client system variable to character_set_connection.
Character_set_results: indicates the character set used by the client to send the result set or return error messages.

The relationship between these character sets is:
Information Input path: client → connection → Server
Information output path: Server → connection → results

In other words, each path must undergo three character set encoding changes. Take the output garbled as an example, according to the character set in the above table: utf8 data in server, the incoming connection to Latin1, the incoming results to Latin1, if the page is set to UTF-8, turn Latin1 of results into a UTF-8. If the two character sets are incompatible, such as Latin1 and utf8, the conversion process is irreversible and destructive. So it cannot be turned back.

Example of another input path: from XXX. enter Chinese characters on the PHP page because XXX. PHP is UTF-8 encoded, so xxx. PHP converts the input Chinese Characters in utf8 format and submits them to MySQL using utf8. However, the mysql client and connection are both Latin1 and the server is utf8. Therefore, when MySQL is stored, set XXX first. the Chinese characters submitted by PHP are converted to Latin1 format, and then to utf8 character format. If we use a third-party software or phpMyAdmin to select this table, and the data stored in the table is utf8 characters After Latin1, it is obtained in utf8 format, of course it looks garbled. The solution is to make all the processes utf8.

[Solve the problem]

I. Start with my. ini

Program code: [select]

#CLIENT
SECTION
default-character-set=utf8
#SERVER SECTION
default-character-set=utf8

Default-character-set = utf8 must be added to the preceding two sections. By default, both sections are Latin1.

Restart MySQL and run the program code: [select]

mysql>
SHOW VARIABLES LIKE 'character%';
mysql> SHOW VARIABLES LIKE 'collation_%';

Make sure that all value items are utf8.

2. Set the character set and collate of the database, table, or column ).
For example, if utf8 is added during table creation, the collation of table fields can be added or not. If this parameter is not added, utf8_general_ci is used by default. Program code: [select]

CREATE
TABLE `tablename4` (
`id` int(11) NOT NULL AUTO_INCREMENT,
`varchar1` varchar(255) DEFAULT NULL,
`varbinary1` varbinary(255) DEFAULT NULL,
PRIMARY KEY (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8

Here we want to explain:
1. There are four levels on the server: Server, database, table, and column. The character set (character set) can be set each time) and collate. If the upper-level character set is set, character encoding is performed based on the upper-level. If the lower-level character set is not set, the lower-level character set is used.
Therefore, if the database-level character set is set, the table-level character set can be set. If both the database-level and table-level settings are set, the data is stored at the table level.
2. Relationship between character set and verification.
Character Set is a set of symbols and encoding. A collation is a set of rules used to compare characters in a character set.
Let's use a simple example. For example, there are four letters, A, B, C, and D. we assign a value for each letter: 'A' = 0, 'B' = 1, 'A' = 2, 'B' = 3. The letter 'A' is a symbol, and the number 0 is the encoding of 'A'. These four letters and Their encoding are combined into a character set. The rules for checking a character set can be divided into multiple types, such as the numbers represented by letters, or case insensitive.

MySQL selects character sets and checking rules as follows:

· If character set X and collate y are specified, character set X and collate y are used.
· If character set X is specified but collate y is not specified, character set X and character set X are used as the default proofreading rules.
If collate y is specified but character set X is not specified, the character set and collate y corresponding to collate y are used.
· Otherwise, use the lower-level server Character Set and server verification rules.

3. Select UTF-8 encoding when saving the page file, and add the UTF-8 encoding method to the page:

For static pages, add <meta http-equiv = "Content-Type" content = "text/html; charset = UTF-8"/>
For PHP files: Header ('conten-type: text/html; charset = UTF-8 ');
For JSP and Servlet:
A) set the encoding format of the Web container. Add the following code to the beginning of your Servlet's doget or dopost method:
Request. setcharacterencoding ("UTF-8 ");
Response. setcharacterencoding ("UTF-8 ");
B) Specify the encoding format for each JSP page. <% @ Page pageencoding = "UTF-8" %>
C) Add useunicode = true after the URL used to connect to the database; characterencoding = UTF-8, for example:
Url = "JDBC: mysql: // db1? Useunicode = true & characterencoding = UTF-8 ",
In the XML file, url = "JDBC: mysql: // db1? Useunicode = true & amp; characterencoding = UTF-8 ",

4. Execute mysql_query ("set names utf8") before performing the CRUD operation ");
Many people on the Internet say they want to execute set names utf8. I think it is unreasonable to set names for each database connection! Whether or not to set names utf8 is still to be tested, because I just flipped through the mysql5.1 documentation and wrote:

The Set names 'X' statement is equivalent to the three statements: program code: [select]

mysql>SET
character_set_client =x;
mysql>SET character_set_results =x;
mysql>SET character_set_connection =x;

The value of default-character-set = utf8 in the client section set in my. INI has been modified to utf8 at the same time.
I just tried it today. If the page has been set to utf8, you don't need to set names. sqlyog on the database client is also not required! Because MySQL has read default_character_set = utf8 from my. ini during initialization and set all the above three variables to utf8. However, if the page is encoded in another way, set names must be encoded in the same way.

The test code XXX. php is as follows: program code: [select]

<? PHP
Header ('conten-type: text/html; charset = UTF-8 ');
Mysql_connect ("localhost", "root", "password") or die ("cocould not connect:". mysql_error ());
Mysql_select_db ("test ");
Mysql_query ("set names utf8"); // you can remove this sentence!
$ STR = "CHN Software Development Co., Ltd., JPN has been released, and Kor has been released, rus has been written into zookeeper and zookeeper has been written into zookeeper ". time ();
$ SQL = "insert into tablename4 (varchar1, varbinary1) values ('". $ Str. "', '". $ Str ."')";
Echo $ SQL. "<HR> ";
Mysql_query ($ SQL );
$ Result = mysql_query ("select ID, varchar1, varbinary1 from tablename4 ");
While ($ ROW = mysql_fetch_array ($ result, mysql_both )){
Printf ("ID: % s, varchar1: % s, varbinary1: % S <br>", $ row [0], $ row ["varchar1"], $ row ["varbinary1"]);
}
Mysql_free_result ($ result );
?>

After this setting, whether it is to insert any utf8 characters on the PHP page, obtained from the PHP page, obtained from phpMyAdmin, and obtained from the MySQL third-party client software, they are all the same Chinese characters and no garbled characters will be found. mysqldump is also a Chinese character. OK. Solve the problem.

[Cmd]
First, in the Chinese Windows system, run mysql.exe in cmd.exe, a little special. By default, the character encoding in cmd.exe in the Chinese Windows system is cp936 or GBK, and cannot display all utf8 characters. So it is normal to see garbled characters on the Character terminal, this problem can be solved in Shell terminals of Unix-like systems.
Secondly, because the input and output formats of CMD cannot be changed, the Chinese system is GBK (you can compare cmd with a Web browser, the browser will change the character encoding of the input and output according to the character encoding of the webpage file). Therefore, if you want to input and output statements with Chinese characters such as insert and select in cmd, ensure that the client, connection, and results characters of myqsl are encoded as GBK or gb2312 by entering the command: Set names GBK. If the SQL file is imported to the database as source, ensure that the character encoding of the client and connection is the same as that of the External SQL file (open in Notepad --> Save
To view the file encoding format), if the file encoding format is not GBK, such as UTF-8, you need to set names utf8 before source; this way, the Chinese data can be correctly imported to the database (but if you select it in CMD at this time, garbled characters will still be displayed, because the results encoding format is utf8 at this time, but it doesn't matter, the data stored in MySQL is correct, but the query result set is returned to the CMD with GBK encoding format in utf8 mode, leading to garbled characters!

I am very impressed here. Today, I am helping an intern girl solve a problem of garbled characters in the imported database. When I get the database file, I first check the statement and confirm that it uses utf8 encoding, then I started to use the CMD command line to import the database (the client, connection, result, database, and server character sets in my database are all utf8). I set names utf8 first, and then the source import, no error is reported. Select finds Chinese garbled characters. So I view the SQL file encoding format for UTF-8, thought, right!

The last question is that the input format of cmd.exe in Windows is GBK, so I set names GBK. The error is returned when I import source again:

Error 1406 (22001): Data too long for column 'xxx' at row xx
Error 1366 (hy000): incorrect string value: '\ XAB \ xa5...' for column 'xxx' at row xx

Suddenly remembered that the SQL file is in UTF-8 format, try to change it to ANSI, again source, no error, select also displays Chinese.
After thinking carefully, I think the explanation is:
First import: Set names utf8; then source import. No error is reported. Chinese garbled characters are found in select.

Cause: no error is reported because my SQL file itself is UTF-8 encoding, and I set names utf8, that is, client, connection, result are utf8, consistent encoding, we can think that the data is properly stored in the database, and the reason why the SELECT statement in cmd is garbled is that the CMD Code cannot be changed, but it is only GBK. Even though the client, connection, and result are all UTF-8 encoded at this time, however, the final display format is GBK! The encoding is inconsistent. Of course there are garbled characters, but I guess in this case, the data in the database is okay! Try again tomorrow!

After the experiment this morning, I dropped the database and re-imported it in this way. No error was reported. Although the SELECT statement still found garbled characters in cmd, it was normally displayed on sqlyog and other clients, prove that I guess it is correct!

Second import: Set names GBK; second source import, an error is reported.

Cause: the encoding format of the SQL file is UTF-8, And the set names GBK causes the client, connection, and result to be changed to GBK. You declare that the input format of the client is GBK, but take a UTF-8 format file to import to me, the input format is inconsistent, resulting in an error!

Third import: First modify the SQL file encoding to ANSI, and finally source. No error is reported. Select is also normal.

Cause: I set names GBK during the second import. At this time, the client, connection, and result are both GBK, and the SQL file format is also compatible with GBK, so it is successful!

Finally, let's take a look at the MySQL manual and understand the MySQL theory. In this case, there is also a theoretical basis for analysis in case of errors. There are also some suggestions for reading English documents. They do not mean that Chinese characters are not good. They just think that people will always make mistakes, and there will inevitably be mistakes in the translation process. Today, when I read the Chinese manual, I find that there is a translation error, moreover, it is very misleading. The English version of the Internet may have missed some ......
 

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.