Character set problem handling in Mysql data verification process _mysql

Source: Internet
Author: User

Scene:
Main Library Db:utf8 Character set
DB:GBK character set for standby library

Demand:
Verifies that the primary data is consistent and fixes

Validation process:
Set the main library connection to UTF8, set the standby connection to GBK, query separately, and compare the returned result sets by verbatim paragraphs.

Show Results:
The original character of the same characters, data verification that inconsistent.

Reason Analysis:
For the main library only, because the character set of the connection is UTF8, the character of the returned characters is encoded in the UTF8 format, the GBK format for the standby, and the comparison by the string comparison function strcasecmp in the program, obviously different character set encoding, the same characters have different binary, So the results are certainly not equal.

Further analysis:
So what kind of character set should be used to establish a connection in this case? GBK or UTF8. In fact, the choice of any kind of character set is OK, as long as access to the main library and keep the character set of the library can be consistent, the only difference is that, if the selected character set is inconsistent with the client's character set, may result in the normal display of characters, that is, characters appear garbled

We take the client's character set as an example, in detail three kinds of cases: "The client here can be considered securecrt"
Note: The green box represents the DB character set, the yellow box represents the connection character set, and the Orange box represents the client
In the first case:

This is the case where the main library returns the GBK encoding of the character, the UTF8 encoding of the return character of the library, and therefore the error occurs when the field is compared.

In the second case:

Access to the main library of the connection is unchanged, the standby connection from UTF8 to GBK, so the database will be returned to the DB of the character set to GBK returned to the client, then for the client, the same characters are through the GBK encoding, so the binary is equal, the checksum result is correct.

In the third case:

Access to the primary and standby are UTF8, so for the main library, the character encoding returned to the client is changed from GBK to UTF8, and the primary and standby are UTF8 encoded with the correct checksum results. However, because the client is GBK encoding, the characters of the returned characters are garbled, but the correctness of the verification result is not affected.

Repair:

Since you choose to access the same character set as any one by one of the primary reserve, it does not affect the correctness of the checksum, so does it affect the repair? Because the coding range of UTF8 is larger than the GBK encoding range, it is possible that some characters GBK cannot be represented if the GBK connection is used to access UTF8 encoded DB.

We take the second case that the main library is GBK, the standby is UTF8, and the GBK is used to access UTF8. Assuming that some of the characters are lost when the UTF8 is converted to GBK, the primary repository is definitely inconsistent because there are some characters GBK cannot be represented. Suppose the fix statement is as follows:

Update T set c1=master_value where C1=slave_value and id=?

where t represents the table name, the ID is the primary key that represents a row, master_value the value of the C1 column for the main library, slave_value the value of the standby C1 column. At this point, slave_value due to UTF8 to GBK has been lost, so the statement execution ultimately affects 0 rows of records, cannot be repaired.

Conclusion:

When a client accesses two different character sets for data validation, the connection takes a larger-range character set. For example, our common character set representation range is as follows:

Latin<gb2312<gbk<utf8

Attached: MySQL client and server communication time character set code conversion process

Related parameters:

–character_set_client: Character set used by client source data

–character_set_connection: Connection Layer Character Set

–character_set_results: Query result Character Set

–character_set_database: The default character set for the currently selected database

–character_set_system: System metadata (field name, etc.) character set

1. Client Request Server

1) Convert the client character set to connection character set

2 Convert connection character set to DB internal character set

2. Server returns results to client

1) Convert DB internal character set to connection character set

2) Convert connection character set to Character_set_results character set

3. Set the character set command: Set names character encoding

Specifies the character set that the client communicates with the server, including requests and returns.

SET NAMES ' x ' is equivalent to:

SET character_set_client = x;

SET character_set_results = x;

SET character_set_connection = x;

Figures:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.