MySQL Chinese Character fields are sorted by pinyin

Source: Internet
Author: User
MySQL uses the default Character Set latin1. That is to say, the Chinese character fields are stored directly using the GBK internal code encoding. When you need to perform pinyin sorting for some fields with Chinese characters

MySQL uses the default Character Set latin1. That is to say, the Chinese character fields are stored directly using the GBK internal code encoding. When you need to perform pinyin sorting for some fields with Chinese characters

MySQL uses the default Character Set of latin1. That is to say, the Chinese character fields are stored directly using the GBK internal code encoding, if you want to sort the fields with Chinese characters in pinyin (especially those with names), the order by keyword cannot be used by default.

After searching on the Internet, most of the methods on the Internet are for databases that use the utf8 character set. The main methods are as follows:

1) directly convert the field to gbk, for example:

SELECT * FROM table order by convert (chinese_field USING gbk );

Or simply change the corresponding field to the gbk character set.

I tested the above method in my database, or simply sorted by fields. The sorting result is not satisfactory.

2) look-up table

Create a new table to store the correspondence between the pinyin initials and the first Chinese characters using the initials. Then, write a function and convert it to gbk and then look up the table to obtain the initials of the first word of the field content.

I also tried this method, which is too troublesome and cannot be sorted correctly for my database.

Later, I searched some chinese character encoding materials and found that the GBK internal code encoding adopts the pinyin sorting method. (generally, 3755 first-level Chinese characters are sorted by pinyin, and second-level Chinese characters are not, however, considering that names and so on are commonly used Chinese characters, it is enough to sort the first-level Chinese characters correctly ). Based on this principle, it should be possible to sort by field directly (my database uses the Latin1 character set, and the Stored Chinese character is originally a GBK internal code), but after I try it, I find that it will not work. Refer to the look-up method in method 2 above. I converted the field content to hexadecimal encoding, and then arranged it. That's OK!

This is the final solution: SELECT * FROM table order by hex (chinese_field) is simple!

This is the result of data sorting output in my example, for example:

Appendix: Chinese character encoding method Overview

ASCII

The ASCII code is a 7-bit code with the encoding range of 0x00-0x7F. The ASCII character set includes English letters, Arabic numerals, punctuation marks, and other characters. 0x00-0x20 and 0x7F contain 33 control characters.

The system that only supports ASCII Code ignores the maximum bit of each byte and considers the low 7 bits as the valid bit. HZ character encoding was designed to transmit Chinese characters in a 7-digit ASCII system. In the early days, many email systems only support ASCII encoding. To transmit Chinese emails, BASE64 or other encoding methods must be used.

GB2312

GB2312 is designed based on the location code. The encoding table is divided into 94 areas, each of which corresponds to 94 characters. The combination of the area code and the location code of each character is the location code of the Chinese character. Generally, the location code is represented by a 10-digit number. For example, if the value is 1601, it indicates 16-digit and 1-digit. The corresponding character is "ah ". Add 0xA0 to the area code and bit Code respectively to get GB2312 encoding.

In the location code, the 01-09 area is the symbol and number area, the 16-87 area is the Chinese character area, and the 10-15 and 88-94 areas are undefined blank areas. It divides the recorded Chinese characters into two levels: the first level is 3755 commonly used Chinese characters, which are placed in the 16-55 area and arranged in the order of Chinese pinyin letters/PEN; the second-level Chinese characters are 3008 frequently used Chinese characters, which are placed in Area 56-87 and arranged in sequence by the beginning/strokes. First-level Chinese characters are sorted by pinyin. This gives you the range of a pinyin In the first-level Chinese character location. Many programs that can obtain pinyin based on Chinese characters are compiled based on this principle.

In addition to common simplified Chinese characters, the GB2312 Character Set also contains Greek letters, Japanese hirakana, Katakana letters, and Russian Spanish letters. You can use traditional Chinese characters to test whether some systems only support GB2312 encoding.

The encoding range of GB2312 is 0xA1A1-0x7E7E. After undefined areas are removed, the actual encoding range is 0xA1A1-0xF7FE.

The EUC-CN can be understood as an alias for GB2312, which is exactly the same as GB2312.

The location code should be considered as the definition of Character Set, defining the included characters and character location, while GB2312 and EUC-CN are the encoding of this character set in the actual computer environment. HZ and ISO-2022-CN are two types of codes corresponding to the location code Character Set, both of which use a 7-bit encoding space to support Chinese characters. The relationship between the location code and GB2312 encoding is a bit like that.

GBK

GBK encoding is a superset of GB2312 encoding, which is fully compatible with GB2312. At the same time, GBK contains all the CJK Chinese characters in the Unicode basic multilingual plane. Like GB2312, GBK also supports Greek letters, Japanese Kana letters, Russian letters, and other characters, but does not support tabulation characters (non-Chinese characters) in Korean ). GBK also contains the Chinese radical and vertical punctuation characters not included in GB2312.

The overall GBK encoding range is 0x8140-0xFEFE, excluding a combination of 0 x 7F for low bytes. The high byte range is 0x81-0xFE, and the low byte range is 0x40-7E and 0x80-0xFE.

The GBK character with a low byte of 0x40-0x7E has some special characteristics, because these characters occupy the location of the ASCII code, which may cause some system troubles.

Some systems use characters (such as "|") in 0x40-0x7E as special symbols. When locating these symbols, the system does not determine whether these symbols belong to a low byte of a gbk character, this will cause incorrect judgment. This problem does not exist in environments that support GB2312. It should be noted that the environment supporting GBK is smaller than a byte 0x80 and may not be ASCII characters. In addition, it is best to use ASCII characters smaller than 0x40 for some special characters, in this way, you can quickly locate a Chinese character without worrying about the other half of the Chinese character. Big5 encoding also has problems.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.