Mysql Latin1 also supports the illusion analysis of emoji characters

Source: Internet
Author: User

At first, the following phenomena were found:

mysql> show variables like  ' character% '; +--------------------------+------------------- --------------------+| variable_name             | Value                                  |+ --------------------------+---------------------------------------+| character_set_client      | latin1                                  | |  character_set_connection | latin1                                   | |  character_set_database   | latin1                                  | |  character_set_filesystem | binary                                  | |  character_set_results    | latin1                                  | |  character_set_server     | utf8mb4                                  | |  character_set_system     | utf8                                    | |  character_sets_dir       | /opt/mysql/server-5.6/share/charsets/  |+--------------------------+---------------------------------------+mysql> show create  table t4\G*************************** 1. row ***************************    Table: t4Create Table: CREATE TABLE  ' T4 '   (   ' Data '  varchar (  default null)  ENGINE=InnoDB DEFAULT CHARSET=latin1mysql> insert  into t4 select  ' \u+1f600 ';

Think it's strange how latin1 also support emoji characters? is not only UTF8MB4 support? So in the StackOverflow question, a netizen's answer feel reasonable, the answer is as follows:

I think you saved to and retrieved from the database a string of bytes that's interpreted by the terminal as a Unicode Character. Check the output of SELECT LENGTH (data), Char_length (data) from T4 to see what ' s happening. They should return different values for multi-byte characters and the same value forlatin1. –axiac hours ago

In addition inadvertently saw a blog, which said:

Throw a question to the table of the latin1 character set, is there a problem for the user to write and read Chinese characters? The answer is no problem, as long as it is properly set. Assuming that SECURECRT is set to Latin1 for both the utf8,character_set_client and the table character sets, referring to the analysis of section 3rd, the problem of character set encoding conversion is not involved in the process of reading and writing data to the user. The UTF8 character into a binary stream is written to the database, extracted, securecrt then the corresponding binary decoding to the corresponding Chinese characters, so it does not affect the user's use.

So now I think the above phenomenon is normal.

Because the default character set for the operating system is UTF8 (lang=en_us. UTF-8), and the client, connection, database are all latin1, so this all the way (from the terminal interface execution insert to save data into the table) there is no encoding conversion, the direct transmission is UTF8 encoded binary stream.

How to verify the above conclusions? So decided to modify the intermediate link character set, see what happens?

mysql> set names gbk;mysql> show variables like  ' character% '; +------- -------------------+---------------------------------------+| variable_name             | Value                                   |+--------------------------+---------------------------------------+|  character_set_client     | gbk                                     | |  character_set_connection | gbk                                    | |  character_set_database   | latin1                                  | |  character_set_filesystem | binary                                  | |  character_set_results    | gbk                                     | |  character_set_server     | utf8mb4                                 | |  character_set_system     | utf8                                    | |  character_sets_dir       | /opt/mysql/server-5.6/share/charsets/  |+--------------------------+---------------------------------------+mysql> insert into  t4 select  ' \u+1f600 '; error 1366  (HY000): incorrect string value:  ' \xf0\x9f\x98\x80 '  for  column  ' Data '  at row 1

Analysis:

Now the operating system is UTF8, client, connection is GBK, the field is Latin1, because the first is the UTF8 binary stream, and the client and connection are GBK, without transcoding, Therefore only in the last when saved to the table field needs to be changed from UTF8 to latin1, because Latin1 can not decode the UTF8 binary stream resulting in the above error.

What happens if you move the character set out of sync? As shown below:

mysql> set character_set_connection = latin1;mysql> show variables  like  ' character% '; +--------------------------+---------------------------------------+| variable_ name            | value                                   |+--------------------------+--------- ------------------------------+| character_set_client     | gbk                                     | |  character_set_connection | latin1                                 | |  character_set_database   | latin1                                  | |  character_set_filesystem | binary                                  | |  character_set_results    | gbk                                     | |  character_set_server     | utf8mb4                                 | |  character_set_system     | utf8                                    | |  character_sets_dir       | /opt/mysql/server-5.6/share/charsets/  |+--------------------------+---------------------------------------+

Now the client and connection are inconsistent, that is, the need to first utf8-->gbk-->latin1, then can now successfully insert the emoji character?

mysql> insert into T4 select ' \u+1f600 ';

Can be inserted, the query results are as follows:

Mysql> Select Data,hex (data) from t4;+------+-----------+| Data |   Hex (data) |+------+-----------+|?? | 3f3f |+------+-----------+

It seems that in the process of UTF8-->GBK, the UTF8 encoded binary stream (F0 9f 98 80) is decoded into a '?? ', and '?? ' Can be latin1 successfully analyzed. But how to simulate the above transformation through a Java program?

Tried the following code but failed to reproduce it successfully.

When the UTF8 turned to GBK, he could not get '?? ', but ' Luozi 榾 '. Somehow?

What if the client in the example above is swapped with connection for a position, as follows:

mysql> show variables like  ' character% '; +--------------------------+------------------- --------------------+| variable_name             | Value                                  |+ --------------------------+---------------------------------------+| character_set_client      | latin1                                  | |  character_set_connection | gbk                                     | |  character_set_database   | latin1                                  | |  character_set_filesystem | binary                                  | |  character_set_results    | gbk                                     | |  character_set_server     | utf8mb4                                 | |  character_set_system     | utf8                                    | |  character_sets_dir       | /opt/mysql/server-5.6/share/charsets/  |+--------------------------+---------------------------------------+

Now the transformation has become so: utf8-->latin1-->gbk-->latin1, from previous experience seems to be able to predict the first step of conversion should be error (incorrect string value: ' \xf0\x9f\x98 \x80 ' for column ' data ' at row 1, but the actual situation is:

mysql> insert into T4 select ' \u+1f600 '; Query OK, 1 row affected (0.01 sec) mysql> Select Data,hex (data) from t4;+------+-----------+| Data |   Hex (data) |+------+-----------+|?? | 3f3f | | ???? | 3f3f3f3f |+------+-----------+

No error can still be inserted successfully, it seems as long as not the last step to insert a record in the table will not error, but this time to become 4 question marks.

The results of this simulation with the Java program are as follows:


Note:

    1. Why use Cp1252 to represent latin1?

The main reference is the table:

Table 5.3 MySQL to Java Encoding Name translations



Mysql Latin1 also supports the illusion analysis of emoji characters

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.