A summary of collation Utf8_unicode_ci and utf8_general_ci in Mysql

A summary of collation Utf8_unicode_ci and utf8_general_ci in Mysql _mysql

Last Update:2017-01-19 Source: Internet

Author: User

Tags mysql client

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

It took so long to find out that I didn't know what the difference was between Utf_bin and Utf_general_ci.
CI is case insensitive, that is, "case-insensitive", A and a will be treated as the same in character judgments;
The bin is binary, and a and a will be treated differently.
For example you run:
SELECT * FROM table WHERE txt = ' a '
Then in Utf8_bin you can't find the line of txt = ' a ', and utf8_general_ci can.
Utf8_general_ci is not case-sensitive, you should use it when registering your username and mailbox.
Utf8_general_cs case sensitive, if the username and mailbox Use this will be a bad result
Utf8_bin: String Each string compiles storage with binary data. Case-sensitive and can store binary content

First, the Official document description
The following is an excerpt from the MySQL 5.1 Chinese manual about Utf8_unicode_ci and utf8_general_ci:

Copy Code code as follows:

Currently, the UTF8_UNICODE_CI proofing rules only partially support the Unicode collation algorithm. Some characters are still not supported. Also, the combination of tokens cannot be fully supported. This mainly affects some minority languages in Vietnam and Russia, such as: Udmurt, Tatar, Bashkir and Mari.

The main feature of UTF8_UNICODE_CI is support for extensions, i.e. when you think of a letter as equal to another letter combination. For example, in German and some other languages ' ß ' equals ' ss '.

Utf8_general_ci is a legacy proofing rule that does not support scaling. It can only be compared between characters. This means that the UTF8_GENERAL_CI proofing rules are relatively fast, but are less accurate than the collation rules used for UTF8_UNICODE_CI.

For example, the comparison of the following two proofing rules using UTF8_GENERAL_CI and Utf8_unicode_ci is equal:
ä= A
ö= O
ü= U

The difference between the two proofing rules is that the following equation is set up for utf8_general_ci:
ß= s

However, for utf8_unicode_ci the following equation is set up:
ß= SS

For a language, the UTF8 character set collation rules that are relevant to a specific language are executed only if the use of utf8_unicode_ci sorting is not good. For example, UTF8_UNICODE_CI works well for German and French, so you no longer need to create special UTF8 proofing rules for both languages.

Utf8_general_ci also applies with German and French, except ' ß ' equals ' s ', not ' SS '. If your application can accept these, then you should use UTF8_GENERAL_CI because it is fast. Otherwise, use utf8_unicode_ci because it is more accurate.

If you want to use the gb2312 encoding, it is recommended that you use Latin1 as the default character set for the datasheet, so that you can insert the data directly in the command line tool in Chinese and display it directly. Instead of using character sets such as gb2312 or GBK, and if you're worried about sorting queries, You can use the Binary property constraint, for example:

Copy Code code as follows:

CREATE TABLE my_table (name varchar () binary NOT NULL default ') Type=myisam default CharSet latin1;

Second, brief summary
There is no real difference between utf8_unicode_ci and utf8_general_ci in Chinese and English.
Utf8_general_ci proofreading speed, but the accuracy is slightly poor.
Utf8_unicode_ci accuracy is high, but the proofreading speed is slightly slow.

If your application has German, French or Russian, please be sure to use UTF8_UNICODE_CI. General use Utf8_general_ci is enough, until now also found no problem ...

Third, the detailed summary

1, for a language only when the use of utf8_unicode_ci sorting does not do well, the implementation of specific language-related UTF8 character set proofing rules. For example, UTF8_UNICODE_CI works well for German and French, so you no longer need to create special UTF8 proofing rules for both languages.
2, Utf8_general_ci also applies with German and French, except '? ' equals ' s ', not ' SS '. If your application can accept these, then you should use UTF8_GENERAL_CI because it is fast. Otherwise, use utf8_unicode_ci because it is more accurate.

With a word overview above this paragraph: utf8_unicode_ci more accurate, utf8_general_ci speed is relatively fast. Usually the accuracy of utf8_general_ci is enough for us to use, after I read a lot of program source, found that most of them are also used utf8_general_ci, so the general selection of new database utf8_general_ci can be

Iv. How to use UTF8 in MySQL5.0
Add the following parameters to the MY.CNF

Copy Code code as follows:

[Mysqld]
init_connect= ' SET NAMES utf8′
Default-character-set=utf8
Default-collation = Utf8_general_ci

Execute query mysql> show variables; Related to the following:

Copy Code code as follows:

collation_connection | Utf8_general_ci
Collation_database | Utf8_general_ci
Collation_server | Utf8_general_ci

Personal opinion, for the use of the database, Utf8-general has been enough accurate, and compared with the Utf8-unicode speed advantage, solid can be assured that the adoption of

Appendix 1: The Old data upgrade method
Take the original character set for Latin1 as an example, upgrade to become the UTF8 character set. Original table: old_table (Default charset=latin1), new table: new_table (Default Charset=utf8).
First step: Export old data

Copy Code code as follows:

Mysqldump--default-character-set=latin1-hlocalhost-uroot-b my_db--tables old_table > Old.sql

Step Two: Convert code (similar to Unix/linux environment)

Copy Code code as follows:

Iconv-t Utf-8-F gb2312-c old.sql > New.sql

Or you can remove the-f parameter and let Iconv automatically determine the original character set

Copy Code code as follows:

Iconv-t Utf-8-C old.sql > New.sql

In this case, assume that the original data is gb2312 encoded by default.
Step Three: Import
Modify Old.sql to add an SQL statement before the INSERT/UPDATE statement begins: "SET NAMES UTF8;", save.

Copy Code code as follows:

Mysql-hlocalhost-uroot my_db < New.sql

Done!!

Attach 2: MySQL clients that support viewing the UTF8 character set have
1.) Mysql-front, it is said that this project has been the MySQL AB stopped, I do not know why, if there are many cracked version can be downloaded (does not mean that I recommend the use of cracked version:-P).
2.) Navicat, another very good MySQL client, the Chinese version just came out, but also invited me to try, overall still good, but also need to pay.
3.) phpMyAdmin, open source PHP project, very good.
4.) Linux under the Terminal Tools (Linux terminal), the end of the character set to UTF8, connected to MySQL, the implementation of set NAMES UTF8; can also read and write UTF8 data.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More