MySQL Chinese Character fields are sorted by pinyin

Last Update:2018-07-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

MySQL uses the default Character Set latin1. That is to say, the Chinese character fields are stored directly using the GBK internal code encoding. When you need to perform pinyin sorting for some fields with Chinese characters

MySQL uses the default Character Set of latin1. That is to say, the Chinese character fields are stored directly using the GBK internal code encoding, if you want to sort the fields with Chinese characters in pinyin (especially those with names), the order by keyword cannot be used by default.

After searching on the Internet, most of the methods on the Internet are for databases that use the utf8 character set. The main methods are as follows:

1) directly convert the field to gbk, for example:

SELECT * FROM table order by convert (chinese_field USING gbk );

Or simply change the corresponding field to the gbk character set.

I tested the above method in my database, or simply sorted by fields. The sorting result is not satisfactory.

2) look-up table

Create a new table to store the correspondence between the pinyin initials and the first Chinese characters using the initials. Then, write a function and convert it to gbk and then look up the table to obtain the initials of the first word of the field content.

I also tried this method, which is too troublesome and cannot be sorted correctly for my database.

Later, I searched some chinese character encoding materials and found that the GBK internal code encoding adopts the pinyin sorting method. (generally, 3755 first-level Chinese characters are sorted by pinyin, and second-level Chinese characters are not, however, considering that names and so on are commonly used Chinese characters, it is enough to sort the first-level Chinese characters correctly ). Based on this principle, it should be possible to sort by field directly (my database uses the Latin1 character set, and the Stored Chinese character is originally a GBK internal code), but after I try it, I find that it will not work. Refer to the look-up method in method 2 above. I converted the field content to hexadecimal encoding, and then arranged it. That's OK!

This is the final solution: SELECT * FROM table order by hex (chinese_field) is simple!

This is the result of data sorting output in my example, for example:

Appendix: Chinese character encoding method Overview

ASCII

The ASCII code is a 7-bit code with the encoding range of 0x00-0x7F. The ASCII character set includes English letters, Arabic numerals, punctuation marks, and other characters. 0x00-0x20 and 0x7F contain 33 control characters.

The system that only supports ASCII Code ignores the maximum bit of each byte and considers the low 7 bits as the valid bit. HZ character encoding was designed to transmit Chinese characters in a 7-digit ASCII system. In the early days, many email systems only support ASCII encoding. To transmit Chinese emails, BASE64 or other encoding methods must be used.

GB2312

GB2312 is designed based on the location code. The encoding table is divided into 94 areas, each of which corresponds to 94 characters. The combination of the area code and the location code of each character is the location code of the Chinese character. Generally, the location code is represented by a 10-digit number. For example, if the value is 1601, it indicates 16-digit and 1-digit. The corresponding character is "ah ". Add 0xA0 to the area code and bit Code respectively to get GB2312 encoding.

In the location code, the 01-09 area is the symbol and number area, the 16-87 area is the Chinese character area, and the 10-15 and 88-94 areas are undefined blank areas. It divides the recorded Chinese characters into two levels: the first level is 3755 commonly used Chinese characters, which are placed in the 16-55 area and arranged in the order of Chinese pinyin letters/PEN; the second-level Chinese characters are 3008 frequently used Chinese characters, which are placed in Area 56-87 and arranged in sequence by the beginning/strokes. First-level Chinese characters are sorted by pinyin. This gives you the range of a pinyin In the first-level Chinese character location. Many programs that can obtain pinyin based on Chinese characters are compiled based on this principle.

In addition to common simplified Chinese characters, the GB2312 Character Set also contains Greek letters, Japanese hirakana, Katakana letters, and Russian Spanish letters. You can use traditional Chinese characters to test whether some systems only support GB2312 encoding.

The encoding range of GB2312 is 0xA1A1-0x7E7E. After undefined areas are removed, the actual encoding range is 0xA1A1-0xF7FE.

The EUC-CN can be understood as an alias for GB2312, which is exactly the same as GB2312.

The location code should be considered as the definition of Character Set, defining the included characters and character location, while GB2312 and EUC-CN are the encoding of this character set in the actual computer environment. HZ and ISO-2022-CN are two types of codes corresponding to the location code Character Set, both of which use a 7-bit encoding space to support Chinese characters. The relationship between the location code and GB2312 encoding is a bit like that.

GBK

GBK encoding is a superset of GB2312 encoding, which is fully compatible with GB2312. At the same time, GBK contains all the CJK Chinese characters in the Unicode basic multilingual plane. Like GB2312, GBK also supports Greek letters, Japanese Kana letters, Russian letters, and other characters, but does not support tabulation characters (non-Chinese characters) in Korean ). GBK also contains the Chinese radical and vertical punctuation characters not included in GB2312.

The overall GBK encoding range is 0x8140-0xFEFE, excluding a combination of 0 x 7F for low bytes. The high byte range is 0x81-0xFE, and the low byte range is 0x40-7E and 0x80-0xFE.

The GBK character with a low byte of 0x40-0x7E has some special characteristics, because these characters occupy the location of the ASCII code, which may cause some system troubles.

Some systems use characters (such as "|") in 0x40-0x7E as special symbols. When locating these symbols, the system does not determine whether these symbols belong to a low byte of a gbk character, this will cause incorrect judgment. This problem does not exist in environments that support GB2312. It should be noted that the environment supporting GBK is smaller than a byte 0x80 and may not be ASCII characters. In addition, it is best to use ASCII characters smaller than 0x40 for some special characters, in this way, you can quickly locate a Chinese character without worrying about the other half of the Chinese character. Big5 encoding also has problems.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

MySQL Chinese Character fields are sorted by pinyin

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

MySQL Chinese Character fields are sorted by pinyin

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support