Problem description: From Sina Weibo crawl message saved to MySQL data, corresponding database field is varchar, character encoding utf-8. Partial insert succeeded, partial insert failed, error such as title.
In the online query, some people say that is the coding problem, suggest to modify the encoding format, such as change to Gbk,utf-8,blob, etc., but few people give a more detailed answer. In an English web site, only to find the cause of the real error. Link 1 Link 2
Error Reason: we can see the character 0xf0 0x9F 0x98 0x84 in the error prompt, which corresponds to the 4-byte encoding in the UTF-8 encoding format (UTF-8 encoding specification). Normal Chinese characters generally don't exceed 3 bytes, why do they appear 4 bytes? In fact, it corresponds to the expression in the smart phone input method. Then why did you make an error? Because utf-8 in MySQL is not a true utf-8, it can only store utf-8 encoding of the length of a byte, if you want to store 4 bytes in a utf8mb4 type. Instead of using the UTF8MB4 type, first make sure the MySQL version is either lower than the MySQL 5.5.3.
Solution:
1) using the UTF8MB4 data type
To use this strategy, if the MySQL version is less than 5.5.3, the version upgrade is done first, and then the corresponding data type is changed to the UTF8MB4 type. If you are using a connector/j connection database, you need to change the encoding format to UTF8MB4 (set character_set_server=utf8mb4 in the Connection config) in the configuration.
2) Custom filter rules that filter or convert the four-byte UTF-8 characters that appear in the text to a custom type.
The following is an example of a test that translates 4-byte characters to 0000.
[Java]View PlainCopy
- for (int i = 0; i < b_text.length; i++)
- {
- if ((B_text[i] & 0xF8) = = 0xF0) {
- For (int j = 0; j < 4; j + +) {
- b_text[i+j]=0x30;
- }
- i+=3;
- }
- }
Unicode Emoticons Insert database problem, incorrect string value: ' \xf0\x9f\x98\x84\xf0\x9f