Mysql Latin1 also supports the illusion analysis of emoji characters

Last Update:2015-05-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

At first, the following phenomena were found:

mysql> show variables like  ' character% '; +--------------------------+------------------- --------------------+| variable_name             | Value                                  |+ --------------------------+---------------------------------------+| character_set_client      | latin1                                  | |  character_set_connection | latin1                                   | |  character_set_database   | latin1                                  | |  character_set_filesystem | binary                                  | |  character_set_results    | latin1                                  | |  character_set_server     | utf8mb4                                  | |  character_set_system     | utf8                                    | |  character_sets_dir       | /opt/mysql/server-5.6/share/charsets/  |+--------------------------+---------------------------------------+mysql> show create  table t4\G*************************** 1. row ***************************    Table: t4Create Table: CREATE TABLE  ' T4 '   (   ' Data '  varchar (  default null)  ENGINE=InnoDB DEFAULT CHARSET=latin1mysql> insert  into t4 select  ' \u+1f600 ';

Think it's strange how latin1 also support emoji characters? is not only UTF8MB4 support? So in the StackOverflow question, a netizen's answer feel reasonable, the answer is as follows:

I think you saved to and retrieved from the database a string of bytes that's interpreted by the terminal as a Unicode Character. Check the output of SELECT LENGTH (data), Char_length (data) from T4 to see what ' s happening. They should return different values for multi-byte characters and the same value forlatin1. –axiac hours ago

In addition inadvertently saw a blog, which said:

Throw a question to the table of the latin1 character set, is there a problem for the user to write and read Chinese characters? The answer is no problem, as long as it is properly set. Assuming that SECURECRT is set to Latin1 for both the utf8,character_set_client and the table character sets, referring to the analysis of section 3rd, the problem of character set encoding conversion is not involved in the process of reading and writing data to the user. The UTF8 character into a binary stream is written to the database, extracted, securecrt then the corresponding binary decoding to the corresponding Chinese characters, so it does not affect the user's use.

So now I think the above phenomenon is normal.

Because the default character set for the operating system is UTF8 (lang=en_us. UTF-8), and the client, connection, database are all latin1, so this all the way (from the terminal interface execution insert to save data into the table) there is no encoding conversion, the direct transmission is UTF8 encoded binary stream.

How to verify the above conclusions? So decided to modify the intermediate link character set, see what happens?

mysql> set names gbk;mysql> show variables like  ' character% '; +------- -------------------+---------------------------------------+| variable_name             | Value                                   |+--------------------------+---------------------------------------+|  character_set_client     | gbk                                     | |  character_set_connection | gbk                                    | |  character_set_database   | latin1                                  | |  character_set_filesystem | binary                                  | |  character_set_results    | gbk                                     | |  character_set_server     | utf8mb4                                 | |  character_set_system     | utf8                                    | |  character_sets_dir       | /opt/mysql/server-5.6/share/charsets/  |+--------------------------+---------------------------------------+mysql> insert into  t4 select  ' \u+1f600 '; error 1366  (HY000): incorrect string value:  ' \xf0\x9f\x98\x80 '  for  column  ' Data '  at row 1

Analysis:

Now the operating system is UTF8, client, connection is GBK, the field is Latin1, because the first is the UTF8 binary stream, and the client and connection are GBK, without transcoding, Therefore only in the last when saved to the table field needs to be changed from UTF8 to latin1, because Latin1 can not decode the UTF8 binary stream resulting in the above error.

What happens if you move the character set out of sync? As shown below:

mysql> set character_set_connection = latin1;mysql> show variables  like  ' character% '; +--------------------------+---------------------------------------+| variable_ name            | value                                   |+--------------------------+--------- ------------------------------+| character_set_client     | gbk                                     | |  character_set_connection | latin1                                 | |  character_set_database   | latin1                                  | |  character_set_filesystem | binary                                  | |  character_set_results    | gbk                                     | |  character_set_server     | utf8mb4                                 | |  character_set_system     | utf8                                    | |  character_sets_dir       | /opt/mysql/server-5.6/share/charsets/  |+--------------------------+---------------------------------------+

Now the client and connection are inconsistent, that is, the need to first utf8-->gbk-->latin1, then can now successfully insert the emoji character?

mysql> insert into T4 select ' \u+1f600 ';

Can be inserted, the query results are as follows:

Mysql> Select Data,hex (data) from t4;+------+-----------+| Data |   Hex (data) |+------+-----------+|?? | 3f3f |+------+-----------+

It seems that in the process of UTF8-->GBK, the UTF8 encoded binary stream (F0 9f 98 80) is decoded into a '?? ', and '?? ' Can be latin1 successfully analyzed. But how to simulate the above transformation through a Java program?

Tried the following code but failed to reproduce it successfully.

When the UTF8 turned to GBK, he could not get '?? ', but ' Luozi 榾 '. Somehow?

What if the client in the example above is swapped with connection for a position, as follows:

mysql> show variables like  ' character% '; +--------------------------+------------------- --------------------+| variable_name             | Value                                  |+ --------------------------+---------------------------------------+| character_set_client      | latin1                                  | |  character_set_connection | gbk                                     | |  character_set_database   | latin1                                  | |  character_set_filesystem | binary                                  | |  character_set_results    | gbk                                     | |  character_set_server     | utf8mb4                                 | |  character_set_system     | utf8                                    | |  character_sets_dir       | /opt/mysql/server-5.6/share/charsets/  |+--------------------------+---------------------------------------+

Now the transformation has become so: utf8-->latin1-->gbk-->latin1, from previous experience seems to be able to predict the first step of conversion should be error (incorrect string value: ' \xf0\x9f\x98 \x80 ' for column ' data ' at row 1, but the actual situation is:

mysql> insert into T4 select ' \u+1f600 '; Query OK, 1 row affected (0.01 sec) mysql> Select Data,hex (data) from t4;+------+-----------+| Data |   Hex (data) |+------+-----------+|?? | 3f3f | | ???? | 3f3f3f3f |+------+-----------+

No error can still be inserted successfully, it seems as long as not the last step to insert a record in the table will not error, but this time to become 4 question marks.

The results of this simulation with the Java program are as follows:

Note:

Why use Cp1252 to represent latin1?

The main reference is the table:

Table 5.3 MySQL to Java Encoding Name translations

Mysql Latin1 also supports the illusion analysis of emoji characters

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Mysql Latin1 also supports the illusion analysis of emoji characters

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Mysql Latin1 also supports the illusion analysis of emoji characters

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support