Incorrect string value garbled problem in MySQL solution

Last Update:2018-07-25 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Have you ever encountered an error similar to the following?

Java.sql.SQLException:Incorrect string value: ' \xf0\x9f\x92\x9c ' for column ' content ' at row 1.

The reason for this exception is that the UTF8 encoding in MySQL will store a character up to 3 bytes if the UTF8 of a character

Encoding takes up 4 bytes (most commonly the emoji emoji character in iOS), and then it is written to the database with an error.

MySQL, starting with version 5.5.3, supports a 4-byte UTF8 encoding, with the encoding name UTF8MB4 (mb4 means max bytes 4), which stores a character up to 4 bytes.

To prove this problem, you can execute the following sql:

select * frominformation_schema.CHARACTER_SETSwhere CHARACTER_SET_NAME like ‘utf8%‘

Results

Therefore, to resolve the above exception, you need to use UTF8MB4 encoding.

After you resolve the database encoding, you also need to resolve the encoding problem that is used by the Client Connection connection object.

Call the created connection object to execute the following sql:

conn.createStatement().execute("SET names ‘utf8mb4‘");

If you use a DataSource data source in your project, you only need to configure the data source, and here is an example of Apache's DBCP data source, which is configured as follows in the spring framework:

<!--data Source-<BeanId="DataSource"class="Org.apache.commons.dbcp.BasicDataSource" ><PropertyName="Driverclassname"Value="Com.mysql.jdbc.Driver" ></Property><PropertyName="url"Value="Jdbc:mysql://${${data-source.prefix}.data-source.host-name}:3306/${${data-source.prefix}.data-source.db-name} ? characterencoding=utf8&amp;autoreconnect=true&amp;failoverreadonly=false&amp;maxreconnects=10 &amp;allowmultiqueries=true "/><PropertyName="Username"Value="${${data-source.prefix}.data-source.username}"/><PropertyName="Password"Value="${${data-source.prefix}.data-source.password}"/><PropertyName="Maxactive"Value="/>"<PropertyName="Maxidle"Value="2"/><PropertyName="Testonborrow"Value="True"/><PropertyName="Testonreturn"Value="True"/><PropertyName="Testwhileidle"Value="True"/><property name= "ValidationQuery"  value= "select 1"  /> <!--  This configuration is used to perform the specified initialization connection Sql -->property name= " Connectioninitsqls "><list><< Span class= "Hljs-name" >value>set names  ' utf8mb4 ' </ value></list> </property></ BEAN>

The following explanations are referenced from the MySQL reference manual:

SET NAMES ' Charset_name '

Set names shows what character set is used in the SQL statement sent by the client.

Therefore, the set NAMES ' UTF8MB4 ' statement tells the server that "the information coming from this client will be in character set Utf8mb4". It also specifies a character set for the result that the server sends back to the client. (For example, if you use a SELECT statement, it indicates what character set the column values use.) ）

The SET NAMES ' x ' statement is equivalent to these three statements:

mysql> SET character_set_client = x;

mysql> SET character_set_results = x;

mysql> SET character_set_connection = x;

After this SQL statement is executed, subsequent statement created through this connection object will execute successfully.

In this case, the problem has been solved perfectly, but I have a new problem to associate with:

When the JVM runs virtual, the in-memory strings are encoded in utf-16, and the characters stored in the 4-byte utf-8 encoding for emoji expressions in iOS are stored in Java runtime.

So, I looked for a emoji character (the value of 4 bytes is 0xf0,0x9f,0x92,0x9c, respectively), and did the following experiment.

byte[] bytes =NewByte[] {(byte) 0xf0, (byte) 0x9F, (byte) 0x92, (byte) 0x9c}; string s = new string (bytes, charset.forname ("Utf-8")); System. out.println ("Length:" +s.length ());  for (int i=0;i<s.length (); i++) {int ch = s.charat (i); System. out.println ("0x" +integer.tohexstring (ch));}

The results of the implementation are as follows:

As can be seen from the results, the Unicode value (also called Codepoint code point, which is described in the following API) is a single character greater than 0xFFFF, and the JVM internally consumes 2 char (that is, 4 bytes) of storage.

All characters greater than 0xffff are all within the auxiliary plane of the UTF encoded table (the domain auxiliary platform corresponds to the base plane, referred to as BMP). Therefore, it is also good to judge whether a char in a string is a base plane character or part of a secondary plane character. Here are some of the APIs in Java.lang.Character:

In the following description, the code point is the Unicode value of the character

Character in API	Describe
Isvalidcodepoint (int codepoint): boolean	Determines whether the input code point is valid, all code points belonging to the UTF definition plane are valid
Isbmpcodepoint (int codepoint): boolean	Determine if the input code point belongs to the base plane, i.e.: 0x0000~0xffff
Issupplementarycodepoint (int codepoint): boolean	Determine if the input code point is a secondary plane, that is: code point >0XFFFF
Issurrogate (Char ch): Boolean	Determines whether the input character is part of a secondary plane character

It is also easy to get the code point of a character in a string, calling String.codepointat (int index): Int.

Finally, on the relationship between Unicode, UCS-2, UCS-4, UTF-8, UTF-16 encoding, readers should use Baidu as their own. There are too many articles, so there is not much to introduce here.

Resources:

MySQL utf8mb4 with emoji expression:

http://my.oschina.net/wingyiu/blog/153357

An analysis of the problem of the insertion failure/death of the uncommon characters under MySQL UTF8 encoding

Incorrect string value garbled problem in MySQL solution

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Incorrect string value garbled problem in MySQL solution

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Incorrect string value garbled problem in MySQL solution

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support