I recently encountered a bug where I tried to save a UTF-8 string in a "UTF8" encoded MariaDB with Rails, and then a bizarre error occurred:
for column ' summary ' at row 1
I use the UTF-8 encoded client, the server is also UTF-8 encoded, the database is also, even to save the string "< ..." is also a legitimate UTF-8.
The crux of the problem is that MySQL's "UTF8" is actually not really UTF-8.
The "UTF8" only supports up to three bytes per character, while the real UTF-8 is a maximum of four bytes per character.
MySQL has not fixed this bug, and they released a character set called "Utf8mb4" in 2010, bypassing the problem.
Of course, they didn't advertise the new character set (perhaps because the bug made them feel awkward), so that developers are still being advised to use "UTF8" on the web, but these suggestions are wrong.
A brief summary is as follows:
I want to clarify here: all MySQL and MariaDB users using "UTF8" should use "UTF8MB4" and Never "UTF8".
What is encoding? What is UTF-8?
As we all know, computers use 0 to store text. For example, the character "C" is stored as "01000011", then the computer will need to go through two steps to display this character:
The computer reads "01000011" and gets the number 67, because 67 is encoded as "01000011".
The computer looks for 67 in the Unicode character set, and "C" is found.
The same:
My Computer maps "C" to 67 in the Unicode character set.
My computer encodes 67 as "01000011" and sends it to the WEB server.
Almost all network applications use the Unicode character set because there is no reason to use other character sets.
The Unicode character set contains millions of characters. The simplest encoding is UTF-32, which uses 32 bits per character. This is easiest because the computer treats 32 bits as numbers all the time, and the computer is most adept at processing numbers. But the problem is, it's a waste of space.
UTF-8 can save space, in UTF-8, the character "C" requires only 8 bits, some less commonly used characters, such as "" requires 32 bits. Other characters may use either 16-bit or 24-bit. An article similar to this article, if using UTF-8 encoding, occupy only about One-fourth of the space UTF-32.
MySQL's "UTF8" character set is incompatible with other programs, and what it calls "" may really be a lump of ...
A Brief History of MySQL
Why does the MySQL developer make "UTF8" invalid? We may be able to find the answer from the submission log.
MySQL supports UTF-8 from version 4.1, which is 2003, and the UTF-8 standard (RFC 3629) that is used today is followed.
Older versions of the UTF-8 standard (RFC 2279) support a maximum of 6 bytes per character. March 28, 2002, MySQL developer used RFC 2279 in the first MySQL 4.1 preview.
In September, they made an adjustment to the MySQL source code: "UTF8 now supports a sequence of up to 3 bytes."
Who submitted the code? Why would he do that? The question is unknown. After migrating to Git (MySQL started with BitKeeper), many of the authors ' names in the MySQL code base were lost. The September 2003 mailing list also found no clues to explain the change.
But I can try to guess.
In 2002, MySQL made a decision: if the user can guarantee that every row in the data table uses the same number of bytes, then MySQL can have a big performance boost. To do this, the user needs to define the text column as "Char", and each "char" column always has the same number of characters. If the inserted character is less than the defined number, MySQL fills the space after it, and if the inserted character exceeds the defined number, the subsequent excess is truncated.
MySQL developers used 6 bytes per character when they first tried UTF-8, char (1) used 6 bytes, char (2) used 12 bytes, and so on.
It should be said that their initial behavior is correct, unfortunately this version has not been released. But it was written on the document and was widely circulated, and everyone who knew UTF-8 agreed with what was written in the document.
However, it is clear that the MySQL developer or vendor is worried about having users do both of these things:
Define columns with char (char is now an antique, but at the time it would be faster to use char in MySQL, but not since 2005).
Set the encoding of the CHAR column to "UTF8".
My guess is that MySQL developers wanted to help users who wanted to win both space and speed, but they messed up the "UTF8" code.
So the result is no winner. Users who want to win both space and speed, when they use the CHAR column "UTF8", actually use more space than expected and are slower than expected. Users who want to be correct will not be able to save characters like "" when they use the "UTF8" encoding.
After this illegal character set is released, MySQL cannot fix it because it requires all users to rebuild their database. Eventually, MySQL re-released the "Utf8mb4" in 2010 to support the real UTF-8.
If you are using MySQL or MariaDB, do not use the "UTF8" encoding and use "UTF8MB4" instead. Here is a guide for turning the character encoding of an existing database from "UTF8" to "utf8mb4":
Https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4
Reproduced in Infoq
Never use "UTF8" in MySQL