An analysis of the problem of the insertion failure/death of the uncommon characters under MySQL UTF8 encoding

Source: Internet
Author: User
Tags truncated


1. Problem: MySQL encountered some Chinese Insert Exception


recently, some students have feedback on such a problem:






The above statement will hang when the load is in the script, the Web front end, the command line operation is either thrown



Incorrect string value: ' \xf0\xa1\x8b\xbe\xe5\xa2 ... ' for column ' name ',



either the contents of the MySQL database will be truncated or garbled, and the other Chinese will be normal.



Well, it looks a little strange. Ha, UTF8 encoding is supposed to cover all Chinese, should not appear above problems.


2, Reason: This UTF8 non-UTF8


Let's take a look at the difference between inserting the exception in Chinese and normal Chinese:






you can see that the text with the exception inserted above is 4 bytes, while our normal insertion is only 3 bytes. But UTF8 character encoding is not variable length, support 1-4 bytes? Would it have something to do with this?



Well, take a look at the official documentation to find out:





    • For A BMP character, UTF8 and utf8mb4 have identical storage characteristics:same code values, same encoding, same length.
    • For a supplementary character, UTF8 cannot store the character at all, while UTF8MB4 requires four bytes to store it. Since UTF8 cannot store the character at all, don't have any supplementary characters in UTF8 columns and you need no T worry about converting characters or losing data when upgrading UTF8 data from older versions of MySQL.
    • utf8mb4 is a superset of UTF8.
emoji expression (Emoji is a special Unicode encoding, common on iOS and Android phones), and a lot of infrequently used Chinese characters, as well as any new Unicode characters, and so on.


The original UTF-8 format uses one to six bytes and can encode 31 characters maximum. The latest UTF-8 specification uses only one to four bytes and can encode up to 21 bits, just to represent all 17 Unicode planes.
UTF8 is a character set in Mysql that supports only a maximum of three bytes of UTF-8 characters, which is the basic multi-text plane in Unicode.
why does UTF8 in Mysql only support a maximum of three bytes of UTF-8 characters? I thought about it, probably because Mysql just started to develop that meeting, Unicode has no auxiliary plane this said. At the time, the Unicode commission was dreaming of "65,535 characters enough for the world". The length of the string in Mysql is counted as the number of characters, not bytes, and for CHAR data types, you need to keep the string long enough. When using the UTF8 character set, the length that needs to be retained is the length of the utf8 longest character multiplied by the length of the string, so it's natural to limit the length of the UTF8 to 3, for example, a CHAR of 300 bytes. As for the subsequent version, why not support the 4-byte length of the UTF-8 character, I think one is for backwards compatibility considerations, there is a basic multilingual plane outside the character is really rarely used.
to save 4-byte-long UTF-8 characters in Mysql, you need to use the UTF8MB4 character set, but only 5.5. After 3 versions are supported (View versions: select version ();).  I think that in order to get better compatibility, you should always use UTF8MB4 instead of UTF8. For char type data, UTF8MB4 consumes more space and, according to Mysql's official recommendation, uses VARCHAR instead of char.


3. Solutions


know the reason, of course, we have to talk about the options to solve the problem at the beginning.


3.1 Upgrading the MySQL version and upgrading the UTF8 character set to Utf8mb4


upgrade your MySQL to 5.5.3 to view the current environment version:


Select version ();


MySQL added this utf8mb4 code after 5.5.3, Mb4 is the most bytes 4 meaning, specifically designed to be compatible with four-byte Unicode. Fortunately, UTF8MB4 is a superset of UTF8, except that there is no need to convert the encoding to UTF8BP4. Of course, in order to save space, the general use of UTF8 is enough.
so the good technique is to adopt the best solution for the present, and then iterate to meet the new requirements gradually.


3.1.1 Modifying table structure directly
--Modify the database character set
ALTER DATABASE test CHARACTER SET = utf8mb4;
--Modify the table character set
alter table test convert to character set utf8mb4;
--Modify the character character set
ALTER TABLE `test` CHANGE COLUMN` name` `name` varchar (12) CHARACTER SET utf8mb4;
3.1.2 Modifying the database default configuration
[client]
default-character-set = utf8mb4
[mysqld]
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
[mysql]
default-character-set = utf8mb4
P.S. If you are using the Java language, you need to upgrade the JDBC driver package to Mysql-connector-java-5.1.14.jar.
3.2 forcibly filter out rare strings


from a business and technical point of view, a compromise can be made to rare strings in advance filter out, because this type of string is used very little, even if the database, display, query when there will be some other problems, rather than directly filtered out, MySQL does not support four bytes of UTF8 on the one hand may be a historical burden, on the other hand is also estimated to save space.


3.2.1 Shell Filter


For example, we can directly first use the SED, awk, Python, perl to load the storage of the script, the four bytes of the rare word fully filtered back into storage:





3.2.2 Filtering operations in Java


To determine how MySQL supports Unicode characters, the pseudo-code is:


for i=1->n
int c=str.codePointAt(i);
if (c<0x0000||c>0xffff) {
    return false;
}
you can change it slightly. 3.3 Avoid client garbled: binary storage and Query


To avoid web pages or the terminal itself does not support utf8 four bytes, can be used in binary mode to operate


create table t1(name varchar(20) charset utf8mb4);
insert into t1 values(0xF0A09080);

set charset binary;
select * from t1;
4, the application, the system to utf-8 four byte character support


Finally, by the way . system support for 4-byte UTF8 characters:


    • windows XP: The XP system I tested does not support 4-byte UTF8 characters, and the browser displays
    • windows 7: Supports 4 byte UTF8 characters
    • mac OS X: Supports 4-byte UTF8 characters
    • iphone/ipad: Supports 4-byte UTF8 characters
Take the PHP scenario as an example to illustrate:
    • PHP connection session setup encoding UTF8, MySQL backend field is text character set UTF8: write content truncated from 4 byte UTF8 character
    • PHP connection Session settings encoding utf8mb4, MySQL backend field is the text character set UTF8: The content can be written in full, but the 4-byte UTF8 character is replaced with a question mark "?"
    • PHP connection Session settings encoding utf8mb4, MySQL backend field is text character Set UTF8MB4: full support 4 byte UTF8 characters
In terms of platform support, support for 4-byte UTF8 characters is necessary as WinXP is phased out.
The Official Handbook notes to UTF8MB4 characters that UTF8MB4 is a superset of UTF8, so it can be upgraded with confidence. 5. The final question


see here, do not know carefully you have found that the code for this article is a map of Mao? To know that I write my own articles rarely make the code into a diagram, it is because ...



wow haha, really which pot does not open to mention which pot Ah ...





6, Refer:


[1] talking about CharSet and character encoding



Http://my.oschina.net/leejun2005/blog/232732#OSC_h3_4



[2] sed matching any ASCII code/hex byte



http://forums.gentoo.org/viewtopic-t-770576-start-0.html



[3] 10.1.10.6 the UTF8MB4 Character Set (4-byte UTF-8 Unicode Encoding)



http://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html



[4] UTF8 in Mysql



http://blog.haipo.me/?p=1149



[5] MySQL 4 byte UTF8 character set problem



http://bbs.chinaunix.net/thread-3766853-1-1.html



[6] mysql,utf-8 and emoji



http://shadyxu.com/mysql_utf8.html



[7] about MySQL truncation content problem Resolution



http://www.momotime.me/2014/10/mysql-utf8mb4/



[8] MySQL stores 4 byte characters



http://www.web-tinker.com/article/20643.html



[9] about UTF-8 encoded MySQL throws incorrect string value problem



http://blog.csdn.net/tannasu/article/details/8064021



[10] about MySQL truncation content problem Resolution



http://daiweiyang.com/mysql_data_truncated/



An analysis of the problem of the insertion failure/death of the uncommon characters under MySQL UTF8 encoding


Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.