Problem description: From Sina Weibo crawl message saved to MySQL data, the corresponding database field varchar, character encoding utf-8. Partial insert succeeded, partial insert failed, error as title.
In the online query, some people say is the coding problem, proposed to modify the encoding format, such as Gbk,utf-8,blob and so on, but few people give a more detailed answer. On an English website, we found the cause of the real mistake. Link 1 Link 2
Error Reason: we can see the character 0xf0 0x9F 0x98 0x84 in the error hint, which corresponds to the 4-byte encoding (UTF-8 encoding specification) in the UTF-8 encoding format. Normal Chinese characters are generally not more than 3 bytes, why is the occurrence of 4 bytes. In fact, it corresponds to the smart phone input method in the expression. Then why is the error? Because utf-8 in MySQL is not really a utf-8, it can only store utf-8 encodings of 1~3 byte lengths, if you want to store 4 bytes of the type that must be utf8mb4. Instead of using the UTF8MB4 type, first make sure that the MySQL version is either less than the MySQL 5.5.3.
Solution:
1 Use UTF8MB4 data type
To use this strategy, if the MySQL version is less than 5.5.3, start with a version upgrade, and then change the corresponding data type to the UTF8MB4 type. If you are using a connector/j connection database, you need to change the encoding format to UTF8MB4 (set character_set_server=utf8mb4 in the Connection config) in the configuration.
2 Custom filtering rules that filter or convert the four-byte UTF-8 characters that appear in the text to a custom type.
Here is a test example that converts a 4-byte character to 0000.
for (int i = 0; i < b_text.length; i++)
{
if (B_text[i] & 0xf8) = 0xF0) {for
(int j = 0; J < 4; J +) {
b_text[i+j]=0x30;
}
I+=3
}
}