Recently, I encountered a strange problem? When Mysql inserts a UTF-8 string, the string is truncated. Print the executed SQL statement and find that it is disconnected from a character that cannot be displayed. At first it was suspected that it was an illegal UTF-8 encoding, but it was found through hexdump that it was a legal UTF-8 character and the encoding length was 4 bytes. Then, resolve the problem
Recently, I encountered a strange problem? When Mysql inserts a UTF-8 string, the string is truncated. Print the executed SQL statement and find that it is disconnected from a character that cannot be displayed. At first it was suspected that it was an illegal UTF-8 encoding, but it was found through hexdump that it was a legal UTF-8 character and the encoding length was 4 bytes. Then, resolve the problem
Recently, I encountered a strange problem? When Mysql inserts a UTF-8 string, the string is truncated. Print the executed SQL statement and find that it is disconnected from a character that cannot be displayed. At first it was suspected that it was an illegal UTF-8 encoding, but it was found through hexdump that it was a legal UTF-8 character and the encoding length was 4 bytes. Then it is decoded as Unicode encoding. After a google click, it turns out to be an Emoji expression (Emoji is a special Unicode encoding that is common on ios and android phones ).
Why is the Emoji expression truncated? In my understanding, Mysql utf8 should be able to accept any valid UTF-8 string. As a reminder, Mysql earlier versions only support up to 3 bytes of UTF-8. After checking the document, it is true that:
-
utf8, A UTF-8 encoding of the Unicode character set using one to three bytes per character.
The latest version of Mysql is still the same as the old version of Mysql. The maximum encoded Unicode Character of the Three-byte UTF-8 is 0 xffff, which is the basic multilingual plane (BMP) in Unicode ). That is to say, Unicode characters that are not in the basic multi-text plane cannot be stored using the utf8 Character Set of Mysql. Including the above Emoji expressions, many uncommon Chinese characters, and any newly added Unicode characters.
This utf8 is not another UTF-8.
The UTF-8 is a transport encoding Format (8-bit Unicode Transformation Format) for Unicode ). The original UTF-8 format uses one to six bytes that can encode up to 31 characters. The latest UTF-8 specification uses only one to four bytes and can encode up to 21 bits, representing exactly all 17 Unicode planes.
Utf8 is a character set in Mysql that only supports up to three bytes of UTF-8 characters, that is, the basic multi-text plane in Unicode.
Why does utf8 in Mysql only support UTF-8 characters with a maximum length of three bytes? I thought for a moment, maybe it was because Mysql was just started to develop and Unicode had no secondary plane. At that time, the Unicode committee had a dream of "65535 characters enough for the whole world. The string length in Mysql is counted as the number of characters rather than the number of bytes. For the CHAR data type, it must be long enough for the string. When utf8 character sets are used, the length to be retained is the maximum length of utf8 characters multiplied by the length of the string. Therefore, the maximum length of utf8 is 3, for example, CHAR (100 )? Mysql retains the length of 300 bytes. As to why the subsequent versions do not support 4-byte UTF-8 characters, I think one is for backward compatibility considerations, and there is also the basic multi-text flat that are rarely used.
To save 4-byte UTF-8 characters in Mysql, you need to use the utf8mb4 character set, which is supported only after Mysql 5.5. I think utf8mb4 should always be used instead of utf8 .? For CHAR-type data, utf8mb4 consumes more space. According to Mysql official recommendations, use VARCHAR? Replace CHAR.
Original article address: utf8 in Mysql. Thank you for sharing it.