An analysis of the problem of the insertion failure/death of the uncommon characters under MySQL UTF8 encoding

Last Update:2015-01-05 Source: Internet

Author: User

Tags truncated

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Original: http://my.oschina.net/leejun2005/blog/343353

Catalogue [-]

1. Problem: MySQL encountered some Chinese insert exception
2, Reason: This UTF8 non-UTF8
3. Solutions
3.1 Upgrading the MySQL version and upgrading the UTF8 character set to Utf8mb4
3.1.1 Modifying table structure directly
3.1.2 Modifying the database default configuration
3.2 Forcibly filter out rare strings
3.2.1 Shell Filter
3.2.2 Filtering operations in Java
3.3 Avoid client garbled: binary storage and query
4, the application, the system to utf-8 four byte character support
5. The final question
6, Refer:

1. Problem: MySQL encountered some Chinese insert exception

Recently, some students have feedback on such a problem:

The above statement will hang when the load is in the script, the Web front end, the command line operation is either thrown

Incorrect string value: ' \xf0\xa1\x8b\xbe\xe5\xa2 ... ' for column ' name ',

Either the contents of the MySQL database will be truncated or garbled, and the other Chinese will be normal.

Well, it looks a little strange. Ha, UTF8 encoding is supposed to cover all Chinese, should not appear above problems.

2, Reason: This UTF8 non-UTF8

Let's take a look at the difference between inserting the exception in Chinese and normal Chinese:

You can see that the text with the exception inserted above is 4 bytes, while our normal insertion is only 3 bytes. But UTF8 character encoding is not variable length, support 1-4 bytes? Would it have something to do with this?

Well, take a look at the official documentation to find out:

10.1.10.6 the UTF8MB4 Character Set (4-byte UTF-8 Unicode Encoding)
The character set named UTF8 uses a maximum of three bytes per character and contains only BMP characters. As of MySQL 5.5.3, the UTF8MB4 character set uses a maximum of four bytes per character supports supplemental characters:

For a BMP character, UTF8 and utf8mb4 with identical storage characteristics:same code values, same encoding, same length .
For a supplementary character, UTF8 cannot store the character at all, while UTF8MB4 requires four bytes to store it. Since UTF8 cannot store the character at all, don't have any supplementary characters in UTF8 columns and you need no T worry about converting characters or losing data when upgrading UTF8 data from older versions of MySQL.
UTF8MB4 is a superset of UTF8.

As the official document shows, MySQL supports UTF8 encoding with a maximum character length of 3 bytes and inserts an exception if it encounters a 4-byte wide character. The three-byte UTF-8 maximum encoded Unicode character is 0xFFFF, which is the basic Multilingual Plane (BMP) in Unicode. That is, any Unicode character that is not in the base multi-text plane cannot be stored using Mysql's UTF8 character set. Includes Emoji emoticons (Emoji is a special Unicode encoding, common on iOS and Android phones), and a lot of infrequently used Chinese characters, as well as any new Unicode characters and so on.

The original UTF-8 format uses one to six bytes and can encode 31 characters maximum. The latest UTF-8 specification uses only one to four bytes and can encode up to 21 bits, just to represent all 17 Unicode planes.
UTF8 is a character set in Mysql that supports only a maximum of three bytes of UTF-8 characters, which is the basic multi-text plane in Unicode.
Why does UTF8 in Mysql only support a maximum of three bytes of UTF-8 characters? I thought about it, probably because Mysql just started to develop that meeting, Unicode has no auxiliary plane this said. At the time, the Unicode commission was dreaming of "65,535 characters enough for the world". The length of the string in Mysql is counted as the number of characters, not bytes, and for CHAR data types, you need to keep the string long enough. When using the UTF8 character set, the length that needs to be retained is the length of the utf8 longest character multiplied by the length of the string, so it's natural to limit the length of the UTF8 to 3, for example, a CHAR of 300 bytes. As for the subsequent version, why not support the 4-byte length of the UTF-8 character, I think one is for backwards compatibility considerations, there is a basic multilingual plane outside the character is really rarely used.
To save 4-byte-long UTF-8 characters in Mysql, you need to use the UTF8MB4 character set, but only 5.5. After 3 versions are supported (View versions: select version ();). I think that in order to get better compatibility, you should always use UTF8MB4 instead of UTF8. For char type data, UTF8MB4 consumes more space and, according to Mysql's official recommendation, uses VARCHAR instead of char.

3. Solutions

Know the reason, of course, we have to talk about the options to solve the problem at the beginning.

3.1 Upgrading the MySQL version and upgrading the UTF8 character set to Utf8mb4

Upgrade your MySQL to 5.5.3 to view the current environment version:

selectversion();

MySQL added this utf8mb4 code after 5.5.3, Mb4 is the most bytes 4 meaning, specifically designed to be compatible with four-byte Unicode. Fortunately, UTF8MB4 is a superset of UTF8, except that there is no need to convert the encoding to UTF8BP4. Of course, in order to save space, the general use of UTF8 is enough.
So the good technique is to adopt the best solution for the present, and then iterate to meet the new requirements gradually.

3.1.1 Directly modify the table structure?

--Modify the database character set
ALTER DATABASE test CHARACTER SET = utf8mb4;
--Modify the table character set
alter table test convert to character set utf8mb4;
--Modify the character character set
ALTER TABLE `test` CHANGE COLUMN` name` `name` varchar (12) CHARACTER SET utf8mb4;

3.1.2 Modify the database default configuration?

[client]

default-character-set = utf8mb4

[mysqld]

character-set-server=utf8mb4

collation-server=utf8mb4_unicode_ci

[mysql]

default-character-set = utf8mb4

P.S. If you are using the Java language, you need to upgrade the JDBC driver package to Mysql-connector-java-5.1.14.jar.
3.2 Forcibly filter out rare strings

From the perspective of business and technology, you can make a compromise, the rare string filter out early, because this kind of string is used very little, even if the database, display, query when there will be many other problems, rather than directly filtered out, MySQL does not support four bytes of UTF8 On the one hand may be a historical burden, on the other hand is also estimated to save space.

3.2.1 Shell Filter

For example, we can directly first use the SED, awk, Python, perl to load the storage of the script, the four bytes of the rare word fully filtered back into storage:

3.2.2 Filtering operations in Java

To determine how MySQL supports Unicode characters, the pseudo-code is:

for i=1->n

int c=str.codePointAt(i);

if (c<0x0000||c>0xffff) {

return false;

}

you can change it slightly. 3.3 Avoid client garbled: binary storage and query

To avoid web pages or the terminal itself does not support utf8 four bytes, can be used in binary mode to operate

create table t1(name varchar(20) charset utf8mb4);

insert into t1values(0xF0A09080);

set charsetbinary;

select *from t1;

4, the application, the system to utf-8 four byte character support

Finally, we summarize the system support for the next 4 bytes UTF8 character:

Windows XP: None of the XP systems I tested support 4-byte UTF8 characters, and browsers display them with placeholders
Windows 7: Supports 4-byte UTF8 characters
Mac OS X: Supports 4-byte UTF8 characters
Iphone/ipad: Supports 4-byte UTF8 characters
Many database software or shell terminals do not support 4-byte UTF8 characters, such as Navicat, SECURECRT

Take the PHP scenario as an example to illustrate:

PHP Connection session setup encoding UTF8, MySQL backend field is text character set UTF8: write content truncated from 4 byte UTF8 character
PHP Connection Session settings encoding utf8mb4, MySQL backend field is the text character set UTF8: The content can be written in full, but the 4-byte UTF8 character is replaced with a question mark "?"
PHP Connection Session settings encoding utf8mb4, MySQL backend field is text character Set UTF8MB4: full support 4 byte UTF8 characters

In terms of platform support, support for 4-byte UTF8 characters is necessary as WinXP is phased out.
The Official Handbook notes to UTF8MB4 characters that UTF8MB4 is a superset of UTF8, so it can be upgraded with confidence. 5. The final question

See here, do not know carefully you have found that the code for this article is a map of Mao? To know that I write my own articles rarely make the code into a diagram, it is because ...

wow haha, really which pot does not open to mention which pot Ah ...

6, Refer:

[1] talking about CharSet and character encoding

Http://my.oschina.net/leejun2005/blog/232732#OSC_h3_4

[2] sed matching any ASCII code/hex byte

Http://forums.gentoo.org/viewtopic-t-770576-start-0.html

[3] 10.1.10.6 the UTF8MB4 Character Set (4-byte UTF-8 Unicode Encoding)

Http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html

[4] UTF8 in Mysql

http://blog.haipo.me/?p=1149

[5] MySQL 4 byte UTF8 character set problem

Http://bbs.chinaunix.net/thread-3766853-1-1.html

[6] mysql,utf-8 and emoji

Http://shadyxu.com/mysql_utf8.html

[7] About MySQL truncation content problem resolution

http://www.momotime.me/2014/10/mysql-utf8mb4/

[8] MySQL stores 4 byte characters

Http://www.web-tinker.com/article/20643.html

[9] About UTF-8 encoded MySQL throws incorrect string value problem

http://blog.csdn.net/tannasu/article/details/8064021

[10] About MySQL truncation content problem resolution

http://daiweiyang.com/mysql_data_truncated/

[one] MySQL kanji 16 binary encoding conversion method

Http://blogread.cn/it/article/3016?f=wb

An analysis of the problem of the insertion failure/death of the uncommon characters under MySQL UTF8 encoding

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More