Never use utf8 in MySQL

Last Update:2020-10-29 Source: Internet

Author: User

Keywords mysql mysql database mysql tutorial

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently I encountered a bug. I tried to save a UTF-8 string in MariaDB encoded with "utf8" through Rails, and then I got a strange error:

Incorrect string value: ‘\xF0\x9F\x98\x83 <…’ for column ‘summary’ at row 1

I use a UTF-8-encoded client, the server is also UTF-8-encoded, and so is the database. Even the string "<..." to be saved is also a legal UTF-8.

The crux of the problem is that MySQL's "utf8" is not actually UTF-8.

"Utf8" only supports up to three bytes per character, while the real UTF-8 is up to four bytes per character. MySQL has not fixed this bug. They released a character set called "utf8mb4" in 2010 to bypass this problem. Of course, they did not advertise the new character set widely (perhaps because this bug made them feel embarrassed), so that developers are still advised to use "utf8" on the Internet, but these suggestions are all wrong.

A brief summary is as follows:

1. The "utf8mb4" of MySQL is the real "UTF-8".
2. MySQL's "utf8" is a kind of "exclusive encoding", it can encode not many Unicode characters.

I want to clarify here: All MySQL and MariaDB users who are using "utf8" should switch to "utf8mb4" and never use "utf8" again.

What is encoding? What is UTF-8?
We all know that computers use 0 and 1 to store text. For example, the character "C" is saved as "01000011", then the computer needs to go through two steps when displaying this character:

1. The computer reads "01000011" and gets the number 67, because 67 is encoded as "01000011".
2. The computer searches for 67 in the Unicode character set and finds "C".

same:

1. My computer maps "C" to 67 in the Unicode character set.
2. My computer encodes 67 into "01000011" and sends it to the Web server.

Almost all web applications use the Unicode character set, because there is no reason to use other character sets.

The Unicode character set contains millions of characters. The simplest encoding is UTF-32, which uses 32 bits for each character. This is the easiest way to do this, because computers have always regarded 32 bits as numbers, and computers are best at processing numbers. But the problem is that this is a waste of space.

UTF-8 can save space. In UTF-8, the character "C" only needs 8 bits, and some uncommon characters, such as "", require 32 bits. Other characters may use 16 bits or 24 bits. For an article like this one, if UTF-8 is used, the space occupied is only about a quarter of that of UTF-32.

MySQL's "utf8" character set is incompatible with other programs. Its so-called character set may be really a mess...

A brief history of MySQL
Why did MySQL developers invalidate "utf8"? We may be able to find the answer in the commit log.

MySQL started to support UTF-8 from version 4.1, which was 2003, and the UTF-8 standard (RFC 3629) used today was only later.
The old version of the UTF-8 standard (RFC 2279) supports up to 6 bytes per character. On March 28, 2002, MySQL developers used RFC 2279 in the first MySQL 4.1 preview release.

In September of the same year, they made an adjustment to the MySQL source code: "UTF8 now only supports sequences of up to 3 bytes." Who submitted these codes? Why would he do this? This problem is unknown. After migrating to Git (MySQL first used BitKeeper), the names of many committers in the MySQL code base were lost. There was no clue to explain this change in the September 2003 mailing list.

But I can try to guess.

In 2002, MySQL made a decision: If users can ensure that each row of the data table uses the same number of bytes, then MySQL can achieve a big improvement in performance. For this, the user needs to define the text column as "CHAR", and each "CHAR" column always has the same number of characters. If the inserted characters are less than the defined number, MySQL will fill in spaces at the end. If the inserted characters exceed the defined number, the excess part will be truncated.

When MySQL developers first tried UTF-8, they used 6 bytes per character, CHAR(1) used 6 bytes, CHAR(2) used 12 bytes, and so on.

It should be said that their initial behavior is correct, but unfortunately this version has not been released. But it is written like this on the document, and it is widely circulated, and everyone who knows UTF-8 agrees with what is written in the document. But obviously, MySQL developers or vendors are worried that users will do these two things:

1 Use CHAR to define the column (in the view of now, CHAR is already old fashioned, but at that time, using CHAR in MySQL will be faster, but this has not been the case since 2005).
2 Set the encoding of the CHAR column to "utf8".

My guess is that the MySQL developers originally wanted to help users who wanted a win-win situation in terms of space and speed, but they messed up the "utf8" encoding.

So the result is no winner. Those users who want to win both in terms of space and speed, when they use the "utf8" CHAR column, they actually use more space than expected and the speed is slower than expected. But users who want correctness cannot save characters like "" when they use "utf8" encoding.

After this illegal character set was released, MySQL could not fix it, because it would require all users to rebuild their databases. Finally, MySQL re-released "utf8mb4" in 2010 to support true UTF-8.

Why does this matter make people so crazy
Because of this problem, I went crazy for a whole week. I was fooled by "utf8" and it took a lot of time to find this bug. But I am definitely not the only one. Almost all articles on the Internet regard "utf8" as the real UTF-8. "Utf8" can only be regarded as a proprietary character set, it has brought us new problems, but has not been solved.

Write at the end
If you are using MySQL or MariaDB, do not use "utf8" encoding, use "utf8mb4" instead. Here is a guide for converting the character encoding of an existing database from "utf8" to "utf8mb4":

https://mathiasbynens.be/notes/mysql-utf8mb4#utf8-to-utf8mb4

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More