Incorrect string value garbled problem in MySQL solution

Source: Internet
Author: User

Incorrect string value garbled problem in MySQL solution

Have you ever encountered an error similar to the following?

Java.sql.SQLException:Incorrect string value: ' \xf0\x9f\x92\x9c ' for column ' content ' at row 1.

The reason for this exception is that the UTF8 encoding in MySQL will store a character up to 3 bytes if the UTF8 of a character

Encoding takes up 4 bytes (most commonly the emoji emoji character in iOS), and then it is written to the database with an error.

MySQL, starting with version 5.5.3, supports a 4-byte UTF8 encoding, with the encoding name UTF8MB4 (mb4 means max bytes 4), which stores a character up to 4 bytes.

To prove this problem, you can execute the following sql:

select * frominformation_schema.CHARACTER_SETSwhere CHARACTER_SET_NAME like ‘utf8%‘

Results

Therefore, to resolve the above exception, you need to use UTF8MB4 encoding.

After you resolve the database encoding, you also need to resolve the encoding problem that is used by the Client Connection connection object.

Call the created connection object to execute the following sql:

conn.createStatement().execute("SET names ‘utf8mb4‘");

If you use a DataSource data source in your project, you only need to configure the data source, and here is an example of Apache's DBCP data source, which is configured as follows in the spring framework:

<!--data Source-<BeanId="DataSource"class="Org.apache.commons.dbcp.BasicDataSource" ><PropertyName="Driverclassname"Value="Com.mysql.jdbc.Driver" ></Property><PropertyName="url"Value="Jdbc:mysql://${${data-source.prefix}.data-source.host-name}:3306/${${data-source.prefix}.data-source.db-name} ? characterencoding=utf8&amp;autoreconnect=true&amp;failoverreadonly=false&amp;maxreconnects=10 &amp;allowmultiqueries=true "/><PropertyName="Username"Value="${${data-source.prefix}.data-source.username}"/><PropertyName="Password"Value="${${data-source.prefix}.data-source.password}"/><PropertyName="Maxactive"Value="/>"<PropertyName="Maxidle"Value="2"/><PropertyName="Testonborrow"Value="True"/><PropertyName="Testonreturn"Value="True"/><PropertyName="Testwhileidle"Value="True"/><property name= "ValidationQuery"  value= "select 1"  /> <!--  This configuration is used to perform the specified initialization connection Sql -->property name= " Connectioninitsqls "><list><< Span class= "Hljs-name" >value>set names  ' utf8mb4 ' </ value></list> </property></ BEAN>             

The following explanations are referenced from the MySQL reference manual:

SET NAMES ' Charset_name '

Set names shows what character set is used in the SQL statement sent by the client.

Therefore, the set NAMES ' UTF8MB4 ' statement tells the server that "the information coming from this client will be in character set Utf8mb4". It also specifies a character set for the result that the server sends back to the client. (For example, if you use a SELECT statement, it indicates what character set the column values use.) )

The SET NAMES ' x ' statement is equivalent to these three statements:

mysql> SET character_set_client = x;

mysql> SET character_set_results = x;

mysql> SET character_set_connection = x;

After this SQL statement is executed, subsequent statement created through this connection object will execute successfully.

In this case, the problem has been solved perfectly, but I have a new problem to associate with:

When the JVM runs virtual, the in-memory strings are encoded in utf-16, and the characters stored in the 4-byte utf-8 encoding for emoji expressions in iOS are stored in Java runtime.

So, I looked for a emoji character (the value of 4 bytes is 0xf0,0x9f,0x92,0x9c, respectively), and did the following experiment.

byte[] bytes =NewByte[] {(byte) 0xf0, (byte) 0x9F, (byte) 0x92, (byte) 0x9c}; string s = new string (bytes, charset.forname ("Utf-8")); System. out.println ("Length:" +s.length ());  for (int i=0;i<s.length (); i++) {int ch = s.charat (i); System. out.println ("0x" +integer.tohexstring (ch));}            

The results of the implementation are as follows:

As can be seen from the results, the Unicode value (also called Codepoint code point, which is described in the following API) is a single character greater than 0xFFFF, and the JVM internally consumes 2 char (that is, 4 bytes) of storage.

All characters greater than 0xffff are all within the auxiliary plane of the UTF encoded table (the domain auxiliary platform corresponds to the base plane, referred to as BMP). Therefore, it is also good to judge whether a char in a string is a base plane character or part of a secondary plane character. Here are some of the APIs in Java.lang.Character:

In the following description, the code point is the Unicode value of the character

Character in API Describe
Isvalidcodepoint (int codepoint): boolean Determines whether the input code point is valid, all code points belonging to the UTF definition plane are valid
Isbmpcodepoint (int codepoint): boolean Determine if the input code point belongs to the base plane, i.e.: 0x0000~0xffff
Issupplementarycodepoint (int codepoint): boolean Determine if the input code point is a secondary plane, that is: code point >0XFFFF
Issurrogate (Char ch): Boolean Determines whether the input character is part of a secondary plane character

It is also easy to get the code point of a character in a string, calling String.codepointat (int index): Int.

Finally, on the relationship between Unicode, UCS-2, UCS-4, UTF-8, UTF-16 encoding, readers should use Baidu as their own. There are too many articles, so there is not much to introduce here.

Resources:

    • MySQL utf8mb4 with emoji expression:

http://my.oschina.net/wingyiu/blog/153357

    • An analysis of the problem of the insertion failure/death of the uncommon characters under MySQL UTF8 encoding

Incorrect string value garbled problem in MySQL solution

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.