MySQL Character Set Problem

Source: Internet
Author: User
Tags mysql text mysql command line

Http://www.codesoil.net/tag/charset

 

Character Set Problem in PHP + mysql4.1 +

Like many people, I found this problem only when I transferred my blog. Although it is a very old problem, here we will summarize the relevant knowledge to help future generations.

[Symptom]

My blog was originally stored on a home server. Recently, I started to move my blog to a rented host. First, copy the file, and everything went smoothly. Next, export the database from MySQL at home and import it to the MySQL of the host provider. Although the MySQL versions on both sides are different (5.xin the house And 4.1x in the rented host), PMA (phpMyAdmin) is installed on both sides.

With this in mind, after I execute "Export" in my PMA at home, I am confused-how can I open the generated SQL file and check whether the posts of wp_posts are garbled? After checking the databases and tables, we can see that the collation column contains utf8_general_ci ...... Although I do not know much about collation, the character encoding should be UTF-8. How can it be garbled? No matter how many hosts, log on to the rented host first! -- As a result, it is still garbled when imported to the rented host.

[Prerequisites]

To solve the problem, it is necessary to learn and review the basic knowledge.

The first is about character set and collation in MySQL? I think the translation into comparison rules may be more appropriate.

Character Set, as its name implies,It is the set of characters and the encoding corresponding to the characters.For example, the simplified Chinese character set gb2312 includes all the specified Chinese Characters in simplified Chinese andCode.

Collation refers to the set of rules for comparing characters. With comparison rules, you can sort a group of data in alphabetical order, Chinese characters in pinyin order, and so on. Obviously, different sorting standards and rules can be applied to the same set of character sets. For example, Chinese characters can be sorted by pinyin or by the number of strokes. In particular, Unicode character sets can be sorted by the sorting methods of different languages because they can contain different types of languages. In addition, binary comparison is used to compare the encoding of Characters in the character set.

This is easy to understand. For example, MySQL supports two types of collation: gb2312_bin and gb2312_general_ci. Obviously, the former is a binary comparison rule, while the latter is a general Chinese character comparison rule.

Each character set has its default collation. For the utf8 character set, the default Collation is utf8_general_ci. To obtain the character set and default collation list supported by MySQL, you can use the show Character Set statement:

Mysql> show character set; + ---------- + signature + | charset | description | default collation | + ---------- + signature + | big5 | big5 traditional Chinese | big5_chinese_ci | dec8 | dec West European | dec8_swedish_ci | | cp850 | dos West European | cp850_general_ci |...

Second, in MySQL, where these character sets and collation are required. In general, there are three character sets and collation in the MySQL System: Server (data), connection, and client. At first glance, the system is clear and clear, but this is not the case. The following is a one-to-one introduction.

[1] server (data) Character Set and collation can be divided into four levels to specify layer-by-layer: Server, database, table, and column. When MySQL accesses data in a column, if the column Character Set and collation are not specified, the data is traced back to the table. If the table does not specify the character set and collation, the database character set and collation are used as the default values. If the database is still not specified, the server Character Set and collation are used as the default values.

So where does the default value of server Character Set and collation come from? The answer is that the command line parameters in the configuration file (My. INI) and mysqld (or mysqld-NT) can be specified. If unfortunately, you do not specify it in my. ini or the command line, MySQL uses the default character set specified during MySQL compilation-Latin1.

However, it should be noted that if MySQL is installed in multiple languages (usually in Chinese), installProgramWill be automatically set in the configuration fileDefault-character-set = utf8

In this way, unless other character sets are explicitly specified for all created databases and tables, UTF will be used as the data character set by default (utf8_general_ci will also be used as the default collation, because it is the default collation of utf8 ).

Related System Variables

Character_set_server : Server Character Set
Collation_server : Server collation
Character_set_database : Database Character Set
Collation_database : Database collation

 

[2] client. For literal Strings sent by the client (such as values in insert and update statements), MySQL needs to know what encoding they are. At the same time, the value that MySQL returns to the client (for example, the return value of the SELECT statement) can also be returned according to the specified encoding.

Related System Variables

Character_set_client : Character Set of the text sent from the client
Character_set_results : Character set used for sending the result to the client

 

[3] connection. The character set used for connection and collation refer to the character set that MySQL converts to after receiving the text sent from the client, and the rules used for comparison. Note that if you compare the text with the value of a column in the database, the character set and collation of the column will be used first.

Related System Variables

Character_set_connection : Character set for connection
Collation_connection : Collation used for connection

 

[Problem Analysis]

With the above preparation knowledge, we began to analyze the initial problem: should have been stored as a UTF-8 character data, why the database has become a "garbled "? And can these garbled characters be displayed by WordPress without any problems? Why cannot I display it normally once it is imported to a rented host?

First, let's take a look at how MySQL System variables are set on my home server.

Note: Some system variables vary depending on the client, so what you see when you log on with the MySQL command line is different from what you see in PHP. In addition, it seems that PMA cannot be used for viewing-it seems that the default system variables have also been changed in PMA. Therefore, to view the default system variables of PHP as the client, you can write a PHP applet similar to the following:

Mysql_connect ( Localhost, $ User , $ Pass ) ;
$ Query = "Show variables" ;
$ Result = Mysql_query ( $ Query ) ;

$ Result contains all system variables. The following results are obtained on my home server (only system variables related to character sets are listed below ):

Character_set_client Latin1
Character_set_connection Latin1
Character_set_database utf8
Character_set_filesystem binary
Character_set_results Latin1
Character_set_server utf8
Character_set_system utf8
Collation_connection latin1_swedish_ci
Collation_database utf8_general_ci
Collation_server utf8_general_ci

Visible, the default client encoding, the default connection encoding is Latin1 -- that is to say, although the WordPress passed to MySQL text is actually coded in UTF-8, but due to the above system variables are not set properly, the UTF-8-encoded text is treated as Latin1-encoded by MySQL, and since the database itself is utf8, the "Latin1 text" is converted into utf8. In this way, a Chinese character actually needs 6 bytes (a Chinese character as a UTF-8 is 3 bytes, is converted as Latin1, each Latin1 character is converted to 2 bytes UTF-8 encoding ). This makes it easy to understand why the database stores "garbled characters.

So why are these "garbled characters" okay when WordPress is displayed? This is because character_set_result is also Latin1. That is to say, MySQL converts the data from utf8 to Latin1 When retrieving the data to Wordpress, then WordPress regards the Latin1 as utf8 -- exactly the inverse process above.

So why can't I display it normally on another server? Take a look at the system variable settings for the rented Host:

Character_set_client ujis
Character_set_connection ujis
Character_set_database ujis
Character_set_results ujis
Character_set_server ujis
Character_set_system utf8
Collation_connection ujis_japanese_ci
Collation_database ujis_japanese_ci
Collation_server ujis_japanese_ci

The Default Client encoding is ujis. That is to say, after MySQL extracts the utf8 data, it will convert it into ujis and pass it to Wordpress. This has gone through Latin1-utf8-ujis conversion, which was originally a utf8 character, and has long been completely unrecognizable ......

[Solution]

The solution has been mentioned on many forums and web pages, and has been proposed in WordPress TRAC.

However, before solving the problem, I would like to know the answer to the question: Is it MySQL or PHP (especially php_mysql extension) or WordPress? Or even user configuration problems? I tend to think this is a problem with WordPress. Because MySQL and PHP do not know what character encoding is used by WordPress, the client character set cannot be changed. As a general WordPress user, they are required to set the character encoding-yes, however, you must provide a user interface instead of directly modifying the source program.

Then the solution (or just a workaround) is to modify the WordPress \ WP-uncludes \ wp-db.php. In function wpdb with more than 40th rows$ This-> select ($ dbname );Add a sentence

$ This -> Query ( "Set names Latin1" ) ;

The Set names statement is executedSet names 'X'It is equivalent to the following three statements.

Set Character_set_client = X;
Set Character_set_results = X;
Set Character_set_connection = X;

In this way, on the rented host where the default client character set is ujis, the imported WordPressArticleIt can also be displayed normally. Of course, this is not a complete solution-this is just "wrong". Anyway, what is stored in the database is converted to utf8 as Latin1, then convert it back to the so-called Latin1. This will make other programs unable to read WordPress data, and more importantly, the "utf8 data" stored in the database cannot be truly sorted by the sort rules that utf8 should have.

The most thorough approach is to add the set names statement mentioned above when installing WordPress, and set the character set of the client to utf8:

$ This -> Query ( "Set names utf8" ) ;

However, the articles written to the database as Latin1 will not be displayed normally. To make them display normally, they must undergo utf8-Latin1 conversion. If the number is large, you can consider writing a program for conversion; if the number is small ...... Manual conversion.

BTW, the Chinese version of WordPress by Chinese experts has already added this sentence. The above information is only applicable to those who use WordPress in English and those who like to follow up.

Finally, we recommend a reference article: Portable PHP-mysql connection charset fix.

xmlns: DC = "http://purl.org/dc/elements/1.1/"
xmlns: trackback = "http://madskills.com/public/xml/rss/module/trackback/">
DC: identifier = "http://www.codesoil.net/2007/05/03/character-set-problem-in-php-mysql41/"
DC: title = "Character Set Problem in PHP + mysql4.1 +"
trackback: Ping = "http://www.codesoil.net/2007/05/03/character-set-problem-in-php-mysql41/trackback/"/>
-->

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.