標籤:
起初發現了如下的現象:
mysql> show variables like ‘character%‘;+--------------------------+---------------------------------------+| Variable_name | Value |+--------------------------+---------------------------------------+| character_set_client | latin1 || character_set_connection | latin1 || character_set_database | latin1 || character_set_filesystem | binary || character_set_results | latin1 || character_set_server | utf8mb4 || character_set_system | utf8 || character_sets_dir | /opt/mysql/server-5.6/share/charsets/ |+--------------------------+---------------------------------------+mysql> show create table t4\G*************************** 1. row *************************** Table: t4Create Table: CREATE TABLE `t4` ( `data` varchar(100) DEFAULT NULL) ENGINE=InnoDB DEFAULT CHARSET=latin1mysql> insert into t4 select ‘\U+1F600‘;
覺得很奇怪怎麼latin1也支援emoji字元了呢?不是只有utf8mb4才支援嗎? 於是在StackOverFlow上提問,一個網友的回答覺得有道理,回答如下:
I think you saved into and retrieved from the database a string of bytes that is interpreted by the terminal as an Unicode character. Check the output of SELECT LENGTH(data), CHAR_LENGTH(data) FROM t4 to see what‘s happening. They should return different values for multi-byte characters and the same value forlatin1. – axiac 19 hours ago
在加上無意中看到了一篇部落格, 其中說道:
拋一個問題,latin1字元集的表,使用者寫入和讀取漢字是否有問題?答案是只要合理設定,沒有問題。假設SecureCRT為UTF8,character_set_client和表字元集均設定為latin1,參考第3節的分析,那麼使用者讀取和寫入資料的過程中,並不涉及字元集編碼轉換的問題,將UTF8的漢字字元轉為二進位流寫入database,提取出來後,secureCRT再將對應的二進位解碼為對應的漢字,所以不影響使用者的使用。
於是現在覺得上述現象很正常。
因為作業系統預設的字元集為utf8(LANG=en_US.UTF-8), 而client、connection、database均為latin1, 於是這一路(從終端介面執行insert到儲存資料到表中)都沒有編碼轉換,直接傳輸的是utf8編碼後的二進位流。
怎麼驗證上述結論呢? 於是決定修改中間環節的字元集,看會發生什麼?
mysql> set names gbk;mysql> show variables like ‘character%‘;+--------------------------+---------------------------------------+| Variable_name | Value |+--------------------------+---------------------------------------+| character_set_client | gbk || character_set_connection | gbk || character_set_database | latin1 || character_set_filesystem | binary || character_set_results | gbk || character_set_server | utf8mb4 || character_set_system | utf8 || character_sets_dir | /opt/mysql/server-5.6/share/charsets/ |+--------------------------+---------------------------------------+mysql> insert into t4 select ‘\U+1F600‘;ERROR 1366 (HY000): Incorrect string value: ‘\xF0\x9F\x98\x80‘ for column ‘data‘ at row 1
分析:
現在作業系統是utf8, client、connection是gbk, 欄位是latin1, 因為一開始是utf8二進位流,且client和connection均為gbk,無需轉碼,故只在最後當儲存到表欄位中時需要由utf8轉為latin1,由於latin1不能解碼該utf8二進位流故導致了上述報錯。
若將字元集不一致的情況再往前挪一步會怎樣呢? 如下所示:
mysql> set character_set_connection = latin1;mysql> show variables like ‘character%‘;+--------------------------+---------------------------------------+| Variable_name | Value |+--------------------------+---------------------------------------+| character_set_client | gbk || character_set_connection | latin1 || character_set_database | latin1 || character_set_filesystem | binary || character_set_results | gbk || character_set_server | utf8mb4 || character_set_system | utf8 || character_sets_dir | /opt/mysql/server-5.6/share/charsets/ |+--------------------------+---------------------------------------+
現在client和connection就不一致了,就是說需要先將utf8-->gbk-->latin1, 那麼現在能成功插入emoji字元嗎?
mysql> insert into t4 select ‘\U+1F600‘;
可以插入,查詢結果如下:
mysql> select data,hex(data) from t4;+------+-----------+| data | hex(data) |+------+-----------+| ?? | 3F3F |+------+-----------+
似乎在utf8-->gbk的過程中,將utf8編碼後的二進位流(f0 9f 98 80)解碼成了‘??’,而‘??’能被latin1成功解析。但如何通過java程式類比上述的轉換呢?
試了下面的代碼 但未能成功再現。
即將utf8轉為gbk時未能得到‘??’,而是‘餜榾’。不知何故?
那假如將上例中的client與connection交換一下位置呢,如下所示:
mysql> show variables like ‘character%‘;+--------------------------+---------------------------------------+| Variable_name | Value |+--------------------------+---------------------------------------+| character_set_client | latin1 || character_set_connection | gbk || character_set_database | latin1 || character_set_filesystem | binary || character_set_results | gbk || character_set_server | utf8mb4 || character_set_system | utf8 || character_sets_dir | /opt/mysql/server-5.6/share/charsets/ |+--------------------------+---------------------------------------+
現在的轉化流變成這樣了:utf8-->latin1-->gbk-->latin1, 從之前的經驗似乎可以預測進行第一步轉化時就應該報錯(Incorrect string value: ‘\xF0\x9F\x98\x80‘ for column ‘data‘ at row 1),但實際情況是:
mysql> insert into t4 select ‘\U+1F600‘;Query OK, 1 row affected (0.01 sec)mysql> select data,hex(data) from t4;+------+-----------+| data | hex(data) |+------+-----------+| ?? | 3F3F || ???? | 3F3F3F3F |+------+-----------+
並未報錯仍能成功插入, 似乎只要不是最後一步往表裡插入記錄就不會報錯,但這次變成4個問號了。
這次用java程式類比的結果如下所示:
註:
為什麼用Cp1252表示latin1?
主要是參考了該表格:
Table 5.3 MySQL to Java Encoding Name Translations
Mysql latin1也支援emoji字元的錯覺分析