Turn from: (translate) friends with MySQL please do not use "UTF8", please use "UTF8MB4"
Friends with MySQL Please do not use "UTF8", please use "UTF8MB4"
Today I'm trying to insert a UTF-8 encoded string into a MARIADB database that uses "UTF8" encoding, rails throws an odd exception:
Incorrect string value: ' \xf0\x9f\x98\x83 < ... ' for column ' summary ' at row 1
Everything is utf-8:utf-8. Client,utf-8 server, UTF-8 encoded database, using UTF-8 character set. “?? <, ... "is a valid UTF-8 string.
But the point is: MySQL database "UTF8" is not the real concept of UTF-8.
The "UTF8" encoding in MySQL only supports a maximum of 3 bytes per character. The real people who are using the UTF-8 encoding are supposed to be able to support 4 bytes per character.
The MySQL developer did not fix the bug. They added a workaround in 2010: A new character set "utf8mb4"
Of course, they were not released (probably because the bug was a little awkward). Now many guidelines recommend users to use "UTF8" is actually wrong.
To put it simply:
The "Utf8mb4" in MySQL is the true meaning of "UTF-8".
MySQL's "UTF8" is a "special character encoding". This encoding is not preserved for many Unicode characters.
I strongly recommend that MySQL and mariadb users use "UTF8MB4" instead of "UTF8".
What is encoding? What is UTF-8?
Joel on software has one of my favorite introductions, and I'm streamlining the description as follows:
The computer uses 0 and 1 to store text. For example, the first paragraph of the first character is stored as "01000011" for "C", the computer is selected by "C" in the following two steps:
The computer reads "01000011" and calculates that this is the number 67.
The computer confirms the "C" represented by 67 by looking up the Unicode character set.
The same thing happened when I typed in C.
The computer maps "C" to 67 through the Unicode character set.
The computer sends the 67 encoding to "01000011" to the Web server.
Almost all programs and Internet applications use the Unicode character set.
There are more than 1 million characters in the Unicode character set ("C" and "??" are two different characters.) )。 UTF-32 is the simplest encoding method that uses 32 bits to represent each character. This coding is simple, but it is not practical, obviously wasted too much space.
UTF-8 is more space-saving than UTF-32. In UTF-8, characters such as "C" Occupy 8bits, "??" This takes up the amount of bits. Other characters occupy 16 or more bits. such as this article with UTF-8 storage than with the UTF-32 to save about 4 times times the space. A smaller footprint also means that the load will be 4 times times faster.
The "UTF8" character set in MySQL is not the same as other application behaviors. For example, there is no way to say "??".
A little bit about the history of MySQL
Why MySQL developers have developed a strange "utf8". We can speculate by submitting a log.
MySQL supports UTF-8 starting from version 4.1. That was 2003 years earlier than today's UTF-8 RFC 3629 standard.
Prior to the UTF-8 Standard, RFC 2279 stated that 6 bytes represent one character. The MySQL developer coded in 2002.3.28 to implement RFC 2279. and released the Pre-pre-release MySQL 4.1
Then in September there was a mysterious byte adjustment. "UTF8 now works with up To3 byte sequences only."
Who submitted the update? Why? I can't answer that. MySQL has lost the old author information after moving to git (MySQL used the same bitkeeper as the Linux kernel)
But I can probably guess why.
Back in 2002, if the user can guarantee that each row in the table has the same number of bytes, MySQL can increase the user's speed. To get this boost, the user needs to define the column "CHAR" that holds the text. A "CHAR" column always has the same number of characters. If fewer characters are deposited, they will be blank at the end. If too much data is stored, the extra characters will be discarded.
When the MySQL developers first tried to implement UTF-8 with 6 bytes per character, they realized that the column of char (1) would take 6 bytes, char (2) would take 12 bytes, and so on.
It is clear that this is not being used in the right way, and any developer who understands UTF-8 will agree.
My guess is that MySQL developers violate the "UTF8" code to help those 1 who try to optimize space and speed, 2) try to optimize the space and speed of those who fail.
This is a change that no one has benefited from. Those who want faster performance and smaller space are still getting bigger and slower implementations than they used to be, while those who want the right "UTF8" get a "??" Implementations that are not stored.
After MySQL released the wrong version, it did not fix it: Because then many users would be forced to rebuild their database. MySQL eventually updated a "utf8mb4"-named UTF-8 implementation in 2010.
Why it's so frustrating
Why are you so fucked?
I've had a really fucked-up this week. I met a bug that was hard to find because I was fooled by the name "UTF8". And I am not a case, I found that almost every recommended use of "UTF8" article is wrong.
The name "UTF8" is still wrong in MySQL. This is a proprietary implementation. This creates new problems and does not solve the problems that he should solve.
If you use MySQL or MariaDB, do not use "UTF8", should always use "utf8mb4", otherwise you will have a headache someday.
Friends with MySQL Please do not use "UTF8", please use "UTF8MB4"