MySQL Character Set GBK to UTF8

Source: Internet
Author: User

MySQL Character Set GBK to UTF8

In the production environment, the character set of the MySQL database needs to be upgraded for various reasons. For example, to support Chinese characters, the character set of latin1 must be upgraded to GBK, you need to upgrade GBK to UTF8. There are a lot of online migration processes. Today I want to talk about the impact of Character Set conversion on the business. I will take GBK conversion to UTF8 as an example.

There are two main points:

1. chinese characters occupy 2 bytes in GBK encoding and 3 bytes in UTF8 encoding, while mysql indexes require a total length of no more than 767 bytes, therefore, the number of index characters is shortened (383-> 255). In particular, for a unique index, the length of the index field must be less than 256 characters.

2. After encoding conversion, the field sorting changes.

This article mainly aims to explain how field sorting is affected after encoding conversion. The reason and analysis will be given based on the mysql source code. First, let's look at the test case. Assume that cmp_t (GBK encoding) and cmp_t2 (UTF8 encoding) are the tables before and after migration.

Test cases:

Operation

Cmp_t (GBK)

Cmp_t2 (UTF8)

1

 

GBK table:

Select c1, hex (c1) from cmp_t;

 

UTF8 table:

Select c1, hex (c1) from cmp_t2;

+ ------ + --------- +

| C1 | hex (c1) |

+ ------ + --------- +

| 1 | D2BB

| 2 | B6FE

| 3 | C8FD

| A | 61

| 1 | 31

+ ------ + --------- +

+ ------ + --------- +

| C1 | hex (c1) |

+ ------ + --------- +

| 1 | E4B880

| 2 | E4BA8C

| 3 | E4B889

| A | 61

| 1 | 31

+ ------ + --------- +

2

GBK table:

Select c1, hex (c1) from cmp_t where c1> 'A' order by c1;

 

UTF8 table:

Select c1, hex (c1) from cmp_t2 where c1> 'A' order by c1;

+ ------ + --------- +

| C1 | hex (c1) |

+ ------ + --------- +

| 2 | B6FE |

| 3 | C8FD

| 1 | D2BB

+ ------ + --------- +

+ ------ + --------- +

| C1 | hex (c1) |

+ ------ + --------- +

| 1 | E4B880

| 3 | E4B889

| 2 | E4BA8C

+ ------ + --------- +

We can obtain the following information from the result returned by the above operation:

  1. The Chinese Character occupies 2 bytes in GBK encoding and 3 bytes in UTF8 encoding.
  2. Numbers and English characters are only one byte in GBK and UTF8 encoding.
  3. Different Chinese characters are encoded in UTF8 and GBK, and their order changes after sorting.

Principle Analysis:

Mysql uses the sortcmp function to compare strings, for GBK strings and UTF8 strings respectively use the interface my_strnncollsp_gbk and my_strnncollsp_utf8, these two functions are implemented in ctype-gbk.c and ctype-utf8.c, respectively, the implementation logic of the two functions is similar, but each function has its own set of comparison rules. Next I will mainly describe the comparison logic and comparison rules of my_strnncollsp_utf8.

Comparison logic:

  1. Obtains the first byte of a string.
  2. According to the UTF-8 encoding rules (Appendix 1), the characters are determined to consist of several bytes.
  3. Calculate the character weighting value (2) based on a certain algorithm and compare the size. If the UTF8 encoding specification is not met, it is considered as garbled and compared in binary format (3)
  4. Skip the number of bytes returned by Step 2 and compare the next character.

Appendix 1: [interface: my_utf8_uni]

According to the UTF-8 encoding rules, characters that conform to the encoding rules occupy 1-6 bytes.

Returns the first byte s of the string.

If s <0x80

Indicates that the character occupies 1 byte.

Elif s <0xe0

2 bytes of Characters

Elif s <0xf0

Represents three bytes of Characters

Else s <0xf8

4 bytes of Characters

Elif s <0xfc

Indicates that the number of characters occupies 5 bytes.

Elif s <0xfe

6 bytes of Characters

English and numeric characters are compatible with ASCII codes. The encoding value is less than 0x80, so only one byte is occupied. The first byte of utf8 encoding for Chinese characters is between [0xe0, 0xf0, so it occupies 3 bytes.

Appendix 2: utf8 encoding comparison rules

Value = (s [0] & 0x0f) <12) | (s [1] ^ 0x80) <6) | (s [2] ^ 0x80)

S [0], s [1], s [2] indicates the three bytes that comprise Chinese characters. Calculate the man characters involved in the comparison to obtain value1 and value2, compare the values of value1 and value2 to determine the character size.

Appendix 3: Binary comparison [interface: bincmp]

Memcmp function comparison, that is, byte comparison.

Therefore, if the business needs to rely on Chinese character comparison, you need to consider the risk of Character Set upgrade (GBK-> UTF8), mainly because the index or primary key contains string fields need special attention, if the string is determined to contain only numbers and characters, there is no problem.

This article permanently updates the link address:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.