MySQL Database Basics (ii)--mysql character set and garbled parsing

Source: Internet
Author: User
Tags mysql client sessions set set

MySQL Database Basics (ii)--mysql character set and garbled parsing one, character set and encoding 1, Character set introduction

Characters (Character) are all kinds of words and symbols, including the national text, punctuation, graphic symbols, numbers and so on.
The character set (Character set) is a collection of multiple characters, with a variety of character sets, each with a different number of characters, common character set names: The ASCII character set, the GB2312 character set, the BIG5 character set, the GB18030 character set, the Unicode charset, and so on. To accurately handle a variety of character set text, the computer needs character encoding, so that the computer can recognize and store a variety of text.
Character encoding (Character encoding) is a character in a character set that is encoded as a character in a specified character set so that the text is stored in the computer and passed through the communication network. Common examples include encoding the Latin alphabet into ascii,ascii, numbering letters, numbers, and other symbols and representing them in 7-bit binary notation.
The word Fu She (collation) refers to the comparison rule between characters in the same character set. Only after you determine the character Fu She can you define what is equivalent on a character set and the size relationship between characters. A character can contain multiple characters Fu She. The MySQL word Fu She naming rules are: Start with the character set name of the word Fu She, center with the country name (or center in general), and end with CI, CS, or bin. The word ending with CI Fu She is case insensitive, the word Fu She at the end of the CS is case sensitive, and the word Fu She at the end of the bin is compared by binary encoded values.

2. ASCII encoding

ASCII is both the coded character set and the character encoding, ASCII directly stores characters in the encoded character set as characters in the computer from numeric values.
For example: In ASCII, the a character is ranked 65th in the table, the sequence number is 65, and the number of the encoded A is 0100 0001, that is, the binary conversion result of decimal 65.

3. Latin1 Character Set

The Latin1 character set is extended on the basis of the ASCII character set, still using a byte to represent the character, but with high levels enabled, extending the representation of the character set.

4, UTF-8 code

UTF-8 (8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode, also known as the Universal Code. Created by Ken Thompson in 1992. It is now standardized to RFC 3629. UTF-8 encodes Unicode characters in 1 to 6 bytes.
UTF-8 is a variable-length byte encoding method. For the UTF-8 encoding of a character, if there is only one byte, its maximum bits is 0, if it is multibyte, its first byte starts at the highest bit, the number of consecutive bits values is 1 determines the number of digits encoded, and the remaining bytes begin with 10. The UTF-8 can be up to 6 bytes. As table:
1 byte 0xxxxxxx
2 bytes 110xxxxx 10xxxxxx
3 bytes 1110xxxx 10xxxxxx 10xxxxxx
4 bytes 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
5 bytes 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
6 bytes 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
Therefore, UTF-8 can be used to represent a maximum of 31 bits of the actual number of characters encoded, which is the bit represented by X in the previous table. Removing the control bit (10 per byte), x represents a bit corresponding to the Unicode encoding of one by one, and the bit order is the same.
When you actually convert Unicode to UTF-8 encoding, you should first remove the high 0 and then determine the minimum number of UTF-8 encoded bits based on the number of digits remaining encoded. Therefore, the characters in the basic ASCII character set (Unicode-compatible ASCII) require only one byte of UTF-8 encoding (7 bits) to be represented.

5. Character Set compatibility

To view the encoding of the specified encoding for any character:
Select Hex (CONVERT (' string ' using code);
To view the characters of the encoded value in the character set:
Select CONVERT (0xABCDXXX using charsetname);
ASCII Encoding Example:
Select Hex (CONVERT (' Hello ' using ASCII);
ASCII encoding of the string "Hello": 0x68656c6c6f
Latin1 code Example:
Select Hex (CONVERT (' Hello ' using latin1);
Latin1 encoding of the string "Hello": 0x68656c6c6f
UTF-8 code Example:
Select Hex (CONVERT (' Hello ' using UTF8);
UTF-8 encoding of the string "Hello": 0x68656c6c6f
GBK code Example:
Select Hex (CONVERT (' Hello ' using GBK);
GBK encoding of the string "Hello": 0x68656c6c6f
GB2312 code Example:
Select Hex (CONVERT (' Hello ' using gb2312);
GB2312 encoding of the string "Hello": 0x68656c6c6f
BIG5 code Example:
Select Hex (CONVERT (' Hello ' using BIG5);
BIG5 encoding of the string "Hello": 0x68656c6c6f
As can be seen from the above example, the Latin1 character set is compatible with the ASCII character set; the UTF-8, GBK, GB2312, BIG5 character sets are compatible with Latin1 character sets.

Chinese "very dick" UTF-8 code example:
Select Hex (CONVERT (' Very dick ' using UTF8);
"Very dick" of the UTF-8 code: 0xe5be88e5b18c
Select CONVERT (0xe5be88e5b18c USING UTF8);
Converts UTF-8 encoded value 0xe5be88e5b18c of "very dick" characters to characters in UTF-8
Chinese "very dick" GBK code example:
Select Hex (CONVERT (' Very dick ' using GBK);
"Very dick" of the GBK code: 0XBADC8CC5
Select CONVERT (0xbadc8cc5 USING GBK);
Converts GBK encoded value 0xbadc8cc5 of "very dick" characters to characters in GBK
Chinese "very dick" GB2312 code example:
Select Hex (CONVERT (' Very dick ' using gb2312);
"Very dick" of the GB2312 code: 0xbadc3f
Select CONVERT (0xbadc3f USING GBK);
Converts the GB2312 encoded value 0xbadc3f of the "very cock" character to a character in GBK, with the result "very?" ", the character" Dick "does not exist in the GB2312 character set.
Chinese "very dick" BIG5 code example:
Select Hex (CONVERT (' Very dick ' using BIG5);
"Very dick" of the BIG5 code: 0xabdcce78
Chinese "very dick" Latin1 code example:
Select Hex (CONVERT (' Very dick ' using latin1);
"Very dick" of the Latin1 code: 0X3F3F
ASCII encoding examples of Chinese "very dick":
Select Hex (CONVERT (' Very dick ' using ASCII);
"Very dick" ASCII code: 0X3F3F
From the above examples can be seen, for Chinese characters, said, UTF-8, GBK, GB2312, BIG5 Four kinds of encoding is incompatible, direct conversion will lead to garbled; when UTF-8, GBK, GB2312, BIG5 four encoding is converted to ASCII encoding and LATIN1 encoded format, each Chinese character is converted to 0x3f, the Chinese character '? ’。
GB2312 supports Simplified Chinese, BIG5 supports traditional Chinese, GBK supports Simplified Chinese and Traditional Chinese, UTF-8 supports almost all characters.
GBK is the standard of GB2312 compatible GB2312 on the basis of national standard. GB2312 is a subset of GBK, and GBK is a subset of GB18030.

Second, MySQL character set 1, MySQL environment variable

Session variable:
Use show variables like '%char% '; You can view the character set variables for session sessions:

Set Character_set_server=utf8;
Set Character_set_database=utf8;
Use set to set different character sets. However, the set of character sets using Set are session-level, and if a new session is opened, the default character set is used by the user.
Global variables:
Use show global variables like '%char% '; You can view the global character set variables:

Set global Character_set_database=utf8;
Set global Character_set_server=utf8;
Use set global to set the character set for multiple sessions.
Use show charset to view MySQL-supported character sets and character Fu She for the corresponding set of characters.
After the MySQL service restarts, the value of global will be reset to the default value. The method for permanently modifying the value of global is as follows:
Modify the MySQL configuration file/etc/my.cnf.
[Mysqld]
Character-set-server=utf8
[Client]
Default-character-set=utf8
[MySQL]
Default-character-set=utf8

2. mysql Character set

The MySQL server can support multiple character sets, providing different levels of settings, including server, database, table, and column.
MySQL database environment variable view using SQL statement show variables like '%char% ';
Character_set_client: The character set used by the client, and the request is encoded in the client character set when the client sends a request to the server.
Character_set_connection: The client/database establishes a communication connection using the character set, and the MySQL server receives the client's query request and converts it to the character set specified by the character_set_connection variable.
Character_set_database: The character set of a database in the database server, if there is no default database character set, use the character set specified by Character_set_server.
Character_set_results: The character set that the database returns to the client, the MySQL database converts the result set and error information to the Character_set_results specified character set, and sends it to the client.
Character_set_server: The character set of the database server, the internal operation character set.
Character_set_system: Character set used by System metadata (field name, etc.)
When the client connects to the server, the client sends its desired character set name to the MySQL service, which then uses the character set to set Character_set_connection, Character_set_client, Character_ Set_results.
If you do not specify a database character set when you create the database, the Character_set_server character set is used by default.
If you do not specify a table's character set when creating a table, the current database character set is used by default.
If you do not specify a character set when creating a column, the character set of the current table is used by default.

3. Setting of MySQL Character set

A, MySQL server-level character set
Modify the MySQL server configuration file/etc/my.cnf file.
[Mysqld]
Character_set_server=utf8
Restart the MySQL database service to take effect.
B, MySQL database-level character set:
Specify when creating the database:
CREATE DATABASE db_name
[[DEFAULT] CHARACTER SET Charset_name]
[[DEFAULT] COLLATE collation_name]
To modify the character set of an existing database:
ALTER DATABASE db_name
[[DEFAULT] CHARACTER SET Charset_name]
[[DEFAULT] COLLATE collation_name]
Alter modification is only valid for operations on the database after modification.
C, MySQL table-level character set:
Specify when creating the table:
CREATE TABLE tbl_name (column_list)
[[DEFAULT] CHARACTER SET Charset_name]
[COLLATE Collation_name]]
To modify the character set of a table:
ALTER TABLE Tbl_name
[[DEFAULT] CHARACTER SET Charset_name]
[COLLATE Collation_name]
D, MySQL field-level character set:
To modify a character set for an existing field:
ALTER TABLE table_name MODIFY
column_name {CHAR | VARCHAR | TEXT} (Column_length)
??? [CHARACTER SET Charset_name]
??? [COLLATE Collation_name]
MySQL client settings: set names UTF8; equivalent to:
Set Character_set_client=utf8;
Set Character_set_connection=utf8;
Set Character_set_results=utf8;
E, client Character set
Modify the MySQL server configuration file/etc/my.cnf file.
[Client]
Default-character-set=utf8
Equivalent to set names UTF8;
Affects the values of variables character_set_client,character_set_connection and Character_set_results in the session.
?? No need to restart MySQL database service to take effect after modification.

4. mysql Character set conversion process


A. When a request is received by the MySQL server, the request data is converted from the character_set_client character set to the character_set_connection character set.
B. Convert the request data from the Character_set_connection character set to the internal operation character set before doing an internal operation. To determine the steps:
--Use the character set set value for each data field;
-If the above value does not exist, the default CHARACTER set value of the corresponding data table is used;
-If the above value does not exist, the default CHARACTER set value of the corresponding database is used;
--If the above value does not exist, use the Character_set_server character set setting value;
C. Convert the operation result from the internal operation character set to the Character_set_results character set.
D. Converts the execution result of the Character_set_results character set to the character_set_client character set, which is sent to the client and the client uses the set CharSet to present the results.

Third, the MySQL generated garbled 1, MySQL garbled generation reasons

The reasons for garbled characters are as follows:
A, deposit and removal of the corresponding link in the code is inconsistent.
B, if there is no lossless encoding conversion between two character sets, it will be garbled.

2. Code Lossless Conversion

If a character x represented by encoding A is converted to the representation of the encoded B, and the character set of the encoded B does not have an X character, the encoding conversion is lossy, otherwise the encoding conversion is lossless.
Because the number of characters supported by each character set is limited, there are differences between the characters that are covered by each character set. When converting UTF-8 characters to GBK characters, MySQL is converted to an error mark (0x3F, question mark) if it is not possible to find a character in the GBK character set in a UTF8 character set.
Conditions for encoding lossless conversions:
A, whether the converted characters are in two character sets at the same time.
B, whether the target character set can not support characters, retain its original expression form.

Reference blog:
http://cenalulu.github.io/

MySQL Database Basics (ii)--mysql character set and garbled parsing

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.