Incorrect string value garbled problem in MySQL solution
Have you ever encountered an error similar to the following?
Java.sql.SQLException:Incorrect string value: ' \xf0\x9f\x92\x9c ' for column ' content ' at row 1.
The reason for this exception is that the UTF8 encoding in MySQL will store a character up to 3 bytes if the UTF8 of a character
Encoding takes up 4 bytes (most commonly the emoji emoji character in iOS), and then it is written to the database with an error.
MySQL, starting with version 5.5.3, supports a 4-byte UTF8 encoding, with the encoding name UTF8MB4 (mb4 means max bytes 4), which stores a character up to 4 bytes.
To prove this problem, you can execute the following sql:
select * frominformation_schema.CHARACTER_SETSwhere CHARACTER_SET_NAME like ‘utf8%‘
Results
Therefore, to resolve the above exception, you need to use UTF8MB4 encoding.
After you resolve the database encoding, you also need to resolve the encoding problem that is used by the Client Connection connection object.
Call the created connection object to execute the following sql:
conn.createStatement().execute("SET names ‘utf8mb4‘");
If you use a DataSource data source in your project, you only need to configure the data source, and here is an example of Apache's DBCP data source, which is configured as follows in the spring framework:
<!--data Source-<BeanId="DataSource"class="Org.apache.commons.dbcp.BasicDataSource" ><PropertyName="Driverclassname"Value="Com.mysql.jdbc.Driver" ></Property><PropertyName="url"Value="Jdbc:mysql://${${data-source.prefix}.data-source.host-name}:3306/${${data-source.prefix}.data-source.db-name} ? characterencoding=utf8&autoreconnect=true&failoverreadonly=false&maxreconnects=10 &allowmultiqueries=true "/><PropertyName="Username"Value="${${data-source.prefix}.data-source.username}"/><PropertyName="Password"Value="${${data-source.prefix}.data-source.password}"/><PropertyName="Maxactive"Value="/>"<PropertyName="Maxidle"Value="2"/><PropertyName="Testonborrow"Value="True"/><PropertyName="Testonreturn"Value="True"/><PropertyName="Testwhileidle"Value="True"/><property name= "ValidationQuery" value= "select 1" /> <!-- This configuration is used to perform the specified initialization connection Sql -->property name= " Connectioninitsqls "><list><< Span class= "Hljs-name" >value>set names ' utf8mb4 ' </ value></list> </property></ BEAN>
The following explanations are referenced from the MySQL reference manual:
SET NAMES ' Charset_name '
Set names shows what character set is used in the SQL statement sent by the client.
Therefore, the set NAMES ' UTF8MB4 ' statement tells the server that "the information coming from this client will be in character set Utf8mb4". It also specifies a character set for the result that the server sends back to the client. (For example, if you use a SELECT statement, it indicates what character set the column values use.) )
The SET NAMES ' x ' statement is equivalent to these three statements:
mysql> SET character_set_client = x;
mysql> SET character_set_results = x;
mysql> SET character_set_connection = x;
After this SQL statement is executed, subsequent statement created through this connection object will execute successfully.
In this case, the problem has been solved perfectly, but I have a new problem to associate with:
When the JVM runs virtual, the in-memory strings are encoded in utf-16, and the characters stored in the 4-byte utf-8 encoding for emoji expressions in iOS are stored in Java runtime.
So, I looked for a emoji character (the value of 4 bytes is 0xf0,0x9f,0x92,0x9c, respectively), and did the following experiment.
byte[] bytes =NewByte[] {(byte) 0xf0, (byte) 0x9F, (byte) 0x92, (byte) 0x9c}; string s = new string (bytes, charset.forname ("Utf-8")); System. out.println ("Length:" +s.length ()); for (int i=0;i<s.length (); i++) {int ch = s.charat (i); System. out.println ("0x" +integer.tohexstring (ch));}
The results of the implementation are as follows:
As can be seen from the results, the Unicode value (also called Codepoint code point, which is described in the following API) is a single character greater than 0xFFFF, and the JVM internally consumes 2 char (that is, 4 bytes) of storage.
All characters greater than 0xffff are all within the auxiliary plane of the UTF encoded table (the domain auxiliary platform corresponds to the base plane, referred to as BMP). Therefore, it is also good to judge whether a char in a string is a base plane character or part of a secondary plane character. Here are some of the APIs in Java.lang.Character:
In the following description, the code point is the Unicode value of the character
Character in API |
Describe |
Isvalidcodepoint (int codepoint): boolean |
Determines whether the input code point is valid, all code points belonging to the UTF definition plane are valid |
Isbmpcodepoint (int codepoint): boolean |
Determine if the input code point belongs to the base plane, i.e.: 0x0000~0xffff |
Issupplementarycodepoint (int codepoint): boolean |
Determine if the input code point is a secondary plane, that is: code point >0XFFFF |
Issurrogate (Char ch): Boolean |
Determines whether the input character is part of a secondary plane character |
It is also easy to get the code point of a character in a string, calling String.codepointat (int index): Int.
Finally, on the relationship between Unicode, UCS-2, UCS-4, UTF-8, UTF-16 encoding, readers should use Baidu as their own. There are too many articles, so there is not much to introduce here.
Resources:
http://my.oschina.net/wingyiu/blog/153357
Incorrect string value garbled problem in MySQL solution