MySQL character set issues, and collation rules

Source: Internet
Author: User
Tags naming convention

Character Set issues:

Basic concepts

? A character (Character) is the smallest of the ideographic symbols in a human language. such as ' A ', ' B ' and so on;
? Given a series of characters, each character is given a numeric value that represents the corresponding character, which is the encoding of the character (Encoding). For example, we give the character ' a ' a value of 0, give the character ' B ' a value of 1, then 0 is the encoding of the character ' a ';
? Given a series of characters and given the corresponding encoding, all of these characters and the set of encoded pairs is the character set (Character set). For example, given a word list characters {' A ', ' B '}, {' A ' =>0, ' B ' =>1} is a character set;
? Word Fu She (Collation) refers to the comparison rules between characters in the same character set;
? After you have determined the character Fu She, you can define what is equivalent and the size relationship between characters on a character set.
? Each character Fu She uniquely corresponds to a character set, but a character set can correspond to a variety of characters Fu She, one of which is the default word Fu She, (default Collation);
? The word Fu She name in MySQL follows the naming convention: begins with the character set name of the word Fu She, _ci (which is case insensitive), _cs (which is case sensitive), or _bin (which means comparison by encoded value). For example: Under the word Fu She "Utf8_general_ci", the characters "a" and "a" are equivalent;

MySQL Character set settings

? System variables:
–character_set_server: Default internal operation character Set
–character_set_client: The character set used by the client source data
–character_set_connection: Connection Layer Character Set
–character_set_results: Query result Character Set
–character_set_database: The default character set for the currently selected database
–character_set_system: System metadata (field name, etc.) character set
– There are also variables that begin with Collation_ to describe the character Fu She.

? Specify the character set of the text string with introducer:
– Format: [_charset] ' string ' [COLLATE collation]
For example
SELECT _latin1 ' string ';
SELECT _utf8 ' Hello ' COLLATE utf8_general_ci;
– Text strings modified by Introducer are converted directly to internal character set processing without excessive transcoding during the request.

The character set conversion process in MySQL

1. When MySQL server receives the request, it converts the request data from character_set_client to character_set_connection;
2. Convert the request data from character_set_connection to the internal operation character set before doing an internal operation, and determine the following method:
-Use the character set value for each data field;
-If the above value does not exist, the default CHARACTER set value of the corresponding data table is used (MySQL extension, non-SQL standard);
-If the above value does not exist, the default CHARACTER set value of the corresponding database is used;
-If the above value does not exist, use Character_set_server to set the value.

3. Convert the operation result from the internal operation character set to Character_set_results.


We are now back to analyze the garbled problem we have:
A our field does not have a character set, so using a table's dataset
b Our table does not specify a character set, which uses the database stored character set by default
C Our database does not specify a character set when it is created, so use Character_set_server to set the value
D We didn't deliberately modify the specified character set for character_set_server, so using MySQL default
e MySQL default character set is Latin1, therefore, we use the latin1 character set, and our character_set_connection character set is UTF-8, inserting Chinese garbled also inevitably.

FAQ Resolution
? FAQ-1 inserting UTF8 encoded data into a data table with a default character set of UTF8, setting the connection character set to UTF8 when queried
– The Character_set_client, Character_set_connection, and Character_set_results are latin1 when plugged in according to the MySQL server's default settings;
– The data for the insert operation will undergo the LATIN1=>LATIN1=>UTF8 character set conversion process, in which each inserted kanji will be saved from the original 3 bytes to 6 bytes;
– The results of the query will undergo the UTF8=>UTF8 character set conversion process, returning the saved 6 bytes intact, resulting in garbled characters.
? UTF8 encoded data is inserted before the data table with the default character set of Latin1 is set to UTF8 (The mistake we encountered was this one.
– Character_set_client, Character_set_connection, and character_set_results are all UTF8 when inserted according to the connection character set;
--the insertion data will be converted to a utf8=>utf8=>latin1 character set, and if the original data contains Unicode characters outside the \U0000~\U00FF range, it will be converted to "?" because it cannot be represented in the latin1 character set. (0x3f) symbol, which cannot be restored at a later time, regardless of the connection character set settings.

some means of detecting character set problems
 ? SHOW CHARACTER SET;
? SHOW COLLATION;
? SHOW VARIABLES like ' character% ';
? SHOW VARIABLES like ' collation% ';
? SQL function Hex, LENGTH, Char_length
? SQL Functions CharSet, COLLATION

recommendations when using the MySQL character set
? When creating a database/table and doing database operations, try to explicitly indicate the character set used, rather than relying on MySQL's default settings, or the MySQL upgrade may cause a lot of trouble;
? Both the database and the connection character set use Latin1, although in most cases can solve the garbled problem, but the disadvantage is that it is not possible to do SQL operations in a character unit, generally the database and the connection character set are UTF8 is a better choice;
? When using MySQL CAPI (the API for MySQL to provide a C-language operation), use Mysql_options to set the Mysql_set_charset_name property to UTF8 immediately after initializing the database handle, so that you do not have to explicitly use the set The names statement specifies the connection character set and resets the connection character set to UTF8 when the long connection is broken with mysql_ping;
? For the MySQL PHP API, the General page-level PHP program has a short running time, after connecting to the database explicitly set the SET NAMES statement once the connection character sets, but when using a long connection, please keep the connection smooth and after disconnecting with set The names statement explicitly resets the connection character set.

Other Precautions
? The Default_character_set setting in MY.CNF only affects the connection character set when the MySQL command connects to the server, and does not have any effect on applications that use the Libmysqlclient library!
? SQL function operations on fields are usually performed in the internal operation character set and are not affected by the connection character set settings.
? Bare strings in SQL statements are affected by the connection character set or introducer settings, which can produce completely different results for operations such as comparisons, and you need to be careful!

Summary
    According to the above analysis and suggestions, we solve the problem we should use what method we should be more clear in mind. Yes, when creating the database, specify the character set, do not go through the modification of the default configuration to achieve the purpose, of course you can also use the form of the specified table character set, but it is easy to omit, especially when many people are involved in the design, more prone to fault.

Although not advocated by modifying the MySQL default character set to solve, but for how to modify the default character set, I still give some methods, for your reference only.

MySQL default character set
MySQL specifies that the character set can be refined to a single database, a table, and a column. Traditional programs do not use a complex configuration when creating databases and data tables, they use the default configuration.
(1) When compiling MySQL, a default character set is specified, and this character set is latin1;
(2) When installing MySQL, you can specify a default character set in the configuration file (My.ini), and if not specified, this value is inherited from the compile-time specified;
(3) When starting mysqld, you can specify a default character set in the command line arguments, if not specified, this value inherits from the configuration in the configuration file, at this time Character_set_server is set to this default character set;
(4) Install MySQL Select Multi-language support, the installer will automatically set the Default_character_set in the configuration file to UTF-8, to ensure that by default all the tables of all the columns of the database with UTF-8 storage.
viewing the default character set
    (by default, MySQL's character set is Latin1 (Iso_8859_1), and how to view it in the above we have given the relevant commands
Modifying the default character set
(1) The simplest method of modification is to modify the character set key values in the MySQL My.ini file,
such as Default-character-set = UTF8
Character_set_server = UTF8
After modifying, restart the MySQL service
(2) There is also a way to modify the character set, that is, the command to use MySQL
mysql> SET character_set_client = UTF8;
mysql> SET character_set_connection = UTF8;
mysql> SET character_set_database = UTF8;
     mysql> SET character_set_results = UTF8;
     mysql> SET character_set_server = UTF8;
     mysql> SET collation_connection = UTF8;
     mysql> SET collation_database = UTF8;
     mysql> SET collation_server = UTF8;
 
    set the default character set for the table to UTF8 and send the query through UTF-8 encoding, the database is still garbled. There may be a problem on the connection connection layer. The workaround is to execute the following sentence before sending the query: SET NAMES ' UTF8 '; it corresponds to the following three-sentence instruction:
set character_set_client = UTF8;
set character_set_results = UTF8;
set character_set_connection = UTF8;

< Span style= "COLOR: #000000" > < Span style= "COLOR: #000000" > < Span style= "COLOR: #000000" > < Span style= "color: #ff0000; Font-size:18pt "> sorting problem:

CI is case insensitive, that is, "casing is not sensitive", a and a will be in the character judgment will be treated as the same.
The bin is binary, and a and a are treated differently.

For example you run:

SELECT * FROM table WHERE txt = ‘a‘;
    • 1

Then in Utf8_bin you can't find the line of txt = ' a ', and utf8_general_ci can.

First, Some tips:
UTF8_GENERAL_CI: Not case-sensitive, you should use this when registering your username and mailbox.
Utf8_general_cs: Case-sensitive, if the user name and mailbox use this will be a bad result.
Utf8_bin: String Each string is compiled with binary data stored. is case-sensitive and can store binary content.

Ii. Brief summary
Utf8_unicode_ci and Utf8_general_ci have no real difference between Chinese and English.
UTF8_GENERAL_CI: Proofreading speed is fast, but the accuracy is slightly worse.
UTF8_UNICODE_CI: High accuracy, but the proofreading speed is slightly slower.
If your application is German, French, or Russian, be sure to use UTF8_UNICODE_CI. The general use of utf8_general_ci is enough, until now also found no problem.

Three, detailed summary
1, for a language only when the use of utf8_unicode_ci sorting is not good, only to perform the specific language-related UTF8 character set collation rules. For example, for German and French, Utf8_unicode_ci works very well, so you no longer need to create special UTF8 proofing rules for both languages.
2. Utf8_general_ci is also available in German and French, except '? ' equals ' s ', not ' SS '. If your app can accept these, then you should use UTF8_GENERAL_CI because it's fast. Otherwise, use utf8_unicode_ci because it is more accurate.
In a word overview above this paragraph: utf8_unicode_ci more accurate, utf8_general_ci speed is relatively fast. Usually the accuracy of utf8_general_ci is enough for us to use, after I have seen a lot of program source code, found that most of them also use is utf8_general_ci, so the new database is generally selected UTF8_GENERAL_CI can be

Note: The above content is excerpted from the network

MySQL character set issues, and collation rules

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.