At first, the following phenomena were found:
mysql> show variables like ' character% '; +--------------------------+------------------- --------------------+| variable_name | Value |+ --------------------------+---------------------------------------+| character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | utf8mb4 | | character_set_system | utf8 | | character_sets_dir | /opt/mysql/server-5.6/share/charsets/ |+--------------------------+---------------------------------------+mysql> show create table t4\G*************************** 1. row *************************** Table: t4Create Table: CREATE TABLE ' T4 ' ( ' Data ' varchar ( default null) ENGINE=InnoDB DEFAULT CHARSET=latin1mysql> insert into t4 select ' \u+1f600 ';
Think it's strange how latin1 also support emoji characters? is not only UTF8MB4 support? So in the StackOverflow question, a netizen's answer feel reasonable, the answer is as follows:
I think you saved to and retrieved from the database a string of bytes that's interpreted by the terminal as a Unicode Character. Check the output of SELECT LENGTH (data), Char_length (data) from T4 to see what ' s happening. They should return different values for multi-byte characters and the same value forlatin1. –axiac hours ago
In addition inadvertently saw a blog, which said:
Throw a question to the table of the latin1 character set, is there a problem for the user to write and read Chinese characters? The answer is no problem, as long as it is properly set. Assuming that SECURECRT is set to Latin1 for both the utf8,character_set_client and the table character sets, referring to the analysis of section 3rd, the problem of character set encoding conversion is not involved in the process of reading and writing data to the user. The UTF8 character into a binary stream is written to the database, extracted, securecrt then the corresponding binary decoding to the corresponding Chinese characters, so it does not affect the user's use.
So now I think the above phenomenon is normal.
Because the default character set for the operating system is UTF8 (lang=en_us. UTF-8), and the client, connection, database are all latin1, so this all the way (from the terminal interface execution insert to save data into the table) there is no encoding conversion, the direct transmission is UTF8 encoded binary stream.
How to verify the above conclusions? So decided to modify the intermediate link character set, see what happens?
mysql> set names gbk;mysql> show variables like ' character% '; +------- -------------------+---------------------------------------+| variable_name | Value |+--------------------------+---------------------------------------+| character_set_client | gbk | | character_set_connection | gbk | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | gbk | | character_set_server | utf8mb4 | | character_set_system | utf8 | | character_sets_dir | /opt/mysql/server-5.6/share/charsets/ |+--------------------------+---------------------------------------+mysql> insert into t4 select ' \u+1f600 '; error 1366 (HY000): incorrect string value: ' \xf0\x9f\x98\x80 ' for column ' Data ' at row 1
Analysis:
Now the operating system is UTF8, client, connection is GBK, the field is Latin1, because the first is the UTF8 binary stream, and the client and connection are GBK, without transcoding, Therefore only in the last when saved to the table field needs to be changed from UTF8 to latin1, because Latin1 can not decode the UTF8 binary stream resulting in the above error.
What happens if you move the character set out of sync? As shown below:
mysql> set character_set_connection = latin1;mysql> show variables like ' character% '; +--------------------------+---------------------------------------+| variable_ name | value |+--------------------------+--------- ------------------------------+| character_set_client | gbk | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | gbk | | character_set_server | utf8mb4 | | character_set_system | utf8 | | character_sets_dir | /opt/mysql/server-5.6/share/charsets/ |+--------------------------+---------------------------------------+
Now the client and connection are inconsistent, that is, the need to first utf8-->gbk-->latin1, then can now successfully insert the emoji character?
mysql> insert into T4 select ' \u+1f600 ';
Can be inserted, the query results are as follows:
Mysql> Select Data,hex (data) from t4;+------+-----------+| Data | Hex (data) |+------+-----------+|?? | 3f3f |+------+-----------+
It seems that in the process of UTF8-->GBK, the UTF8 encoded binary stream (F0 9f 98 80) is decoded into a '?? ', and '?? ' Can be latin1 successfully analyzed. But how to simulate the above transformation through a Java program?
Tried the following code but failed to reproduce it successfully.
When the UTF8 turned to GBK, he could not get '?? ', but ' Luozi 榾 '. Somehow?
What if the client in the example above is swapped with connection for a position, as follows:
mysql> show variables like ' character% '; +--------------------------+------------------- --------------------+| variable_name | Value |+ --------------------------+---------------------------------------+| character_set_client | latin1 | | character_set_connection | gbk | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | gbk | | character_set_server | utf8mb4 | | character_set_system | utf8 | | character_sets_dir | /opt/mysql/server-5.6/share/charsets/ |+--------------------------+---------------------------------------+
Now the transformation has become so: utf8-->latin1-->gbk-->latin1, from previous experience seems to be able to predict the first step of conversion should be error (incorrect string value: ' \xf0\x9f\x98 \x80 ' for column ' data ' at row 1, but the actual situation is:
mysql> insert into T4 select ' \u+1f600 '; Query OK, 1 row affected (0.01 sec) mysql> Select Data,hex (data) from t4;+------+-----------+| Data | Hex (data) |+------+-----------+|?? | 3f3f | | ???? | 3f3f3f3f |+------+-----------+
No error can still be inserted successfully, it seems as long as not the last step to insert a record in the table will not error, but this time to become 4 question marks.
The results of this simulation with the Java program are as follows:
Note:
Why use Cp1252 to represent latin1?
The main reference is the table:
Table 5.3 MySQL to Java Encoding Name translations
Mysql Latin1 also supports the illusion analysis of emoji characters