MySQL Character Set GBK to UTF8

Last Update:2015-05-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the production environment, the character set of the MySQL database needs to be upgraded for various reasons. For example, to support Chinese characters, the character set of latin1 must be upgraded to GBK, you need to upgrade GBK to UTF8. There are a lot of online migration processes. Today I want to talk about the impact of Character Set conversion on the business. I will take GBK conversion to UTF8 as an example.

There are two main points:

1. chinese characters occupy 2 bytes in GBK encoding and 3 bytes in UTF8 encoding, while mysql indexes require a total length of no more than 767 bytes, therefore, the number of index characters is shortened (383-> 255). In particular, for a unique index, the length of the index field must be less than 256 characters.

2. After encoding conversion, the field sorting changes.

This article mainly aims to explain how field sorting is affected after encoding conversion. The reason and analysis will be given based on the mysql source code. First, let's look at the test case. Assume that cmp_t (GBK encoding) and cmp_t2 (UTF8 encoding) are the tables before and after migration.

Test cases:

Operation

Cmp_t (GBK)

Cmp_t2 (UTF8)

GBK table:

Select c1, hex (c1) from cmp_t;

UTF8 table:

Select c1, hex (c1) from cmp_t2;

+ ------ + --------- +

| C1 | hex (c1) |

+ ------ + --------- +

| 1 | D2BB

| 2 | B6FE

| 3 | C8FD

| A | 61

| 1 | 31

+ ------ + --------- +

| C1 | hex (c1) |

+ ------ + --------- +

| 1 | E4B880

| 2 | E4BA8C

| 3 | E4B889

| A | 61

| 1 | 31

+ ------ + --------- +

GBK table:

Select c1, hex (c1) from cmp_t where c1> 'A' order by c1;

UTF8 table:

Select c1, hex (c1) from cmp_t2 where c1> 'A' order by c1;

+ ------ + --------- +

| C1 | hex (c1) |

+ ------ + --------- +

| 2 | B6FE |

| 3 | C8FD

| 1 | D2BB

+ ------ + --------- +

| C1 | hex (c1) |

+ ------ + --------- +

| 1 | E4B880

| 3 | E4B889

| 2 | E4BA8C

+ ------ + --------- +

We can obtain the following information from the result returned by the above operation:

The Chinese Character occupies 2 bytes in GBK encoding and 3 bytes in UTF8 encoding.
Numbers and English characters are only one byte in GBK and UTF8 encoding.
Different Chinese characters are encoded in UTF8 and GBK, and their order changes after sorting.

Principle Analysis:

Mysql uses the sortcmp function to compare strings, for GBK strings and UTF8 strings respectively use the interface my_strnncollsp_gbk and my_strnncollsp_utf8, these two functions are implemented in ctype-gbk.c and ctype-utf8.c, respectively, the implementation logic of the two functions is similar, but each function has its own set of comparison rules. Next I will mainly describe the comparison logic and comparison rules of my_strnncollsp_utf8.

Comparison logic:

Obtains the first byte of a string.
According to the UTF-8 encoding rules (Appendix 1), the characters are determined to consist of several bytes.
Calculate the character weighting value (2) based on a certain algorithm and compare the size. If the UTF8 encoding specification is not met, it is considered as garbled and compared in binary format (3)
Skip the number of bytes returned by Step 2 and compare the next character.

Appendix 1: [interface: my_utf8_uni]

According to the UTF-8 encoding rules, characters that conform to the encoding rules occupy 1-6 bytes.

Returns the first byte s of the string.

If s <0x80

Indicates that the character occupies 1 byte.

Elif s <0xe0

2 bytes of Characters

Elif s <0xf0

Represents three bytes of Characters

Else s <0xf8

4 bytes of Characters

Elif s <0xfc

Indicates that the number of characters occupies 5 bytes.

Elif s <0xfe

6 bytes of Characters

English and numeric characters are compatible with ASCII codes. The encoding value is less than 0x80, so only one byte is occupied. The first byte of utf8 encoding for Chinese characters is between [0xe0, 0xf0, so it occupies 3 bytes.

Appendix 2: utf8 encoding comparison rules

Value = (s [0] & 0x0f) <12) | (s [1] ^ 0x80) <6) | (s [2] ^ 0x80)

S [0], s [1], s [2] indicates the three bytes that comprise Chinese characters. Calculate the man characters involved in the comparison to obtain value1 and value2, compare the values of value1 and value2 to determine the character size.

Appendix 3: Binary comparison [interface: bincmp]

Memcmp function comparison, that is, byte comparison.

Therefore, if the business needs to rely on Chinese character comparison, you need to consider the risk of Character Set upgrade (GBK-> UTF8), mainly because the index or primary key contains string fields need special attention, if the string is determined to contain only numbers and characters, there is no problem.

This article permanently updates the link address:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

MySQL Character Set GBK to UTF8

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

MySQL Character Set GBK to UTF8

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support