about using the UTF-8 fields in MySQL

Source: Internet
Author: User

https://www.adayinthelifeof.nl/2010/12/04/about-using-utf-8-fields-in-mysql/

I sometimes hear: "Make everything utf-8 on your database, and all would be fine". This so-called advice could is further from the truth. Indeed, it'll take care of internationalization and code-page problems when you use UTF-8, but it comes with a price, WH Ich may is too high in the especially if you have never realized it ' s there. indexing is everything ... or at least.. Good indexing makes or breaks your database. The fact remains:the smaller your indexes, the more index records can is loaded into memory and the faster the searches W Ill be. So using small indexes pays off.  Period. But what have got this to does with UTF-8?

First Off:beware of the VARCHAR

As you know, a VARCHAR field can hold a variable amount of data in which you have supply the maximum amount that's you can s Tore. So a VARCHAR (255) can hold 255 characters, if you store is only 5 characters, it would only use 5 characters of data. The other is not lost. This was completely different than using a CHAR (255) where storing a 5 character string results in padding of character S. So VARCHAR () had a big advantage over CHAR () if you had variable sized strings. But you had to realize the this advantage was for disk storage only. It does not apply to any other data structure this MySQL uses internally or for indexes.

How MySQL treats Varchars

When MySQL needs-to-sort records, it must create some space for sorting that data. This space allocation was done before the actual sorting takes place. This however, means the MySQL needs to know how much memory it needs to allocate. When we need to sort VARCHAR fields, MySQL would take care of the allocating the worst-case memory usage, which is the Maximum size a VARCHAR field can take. For Example:when declared a field as VARCHAR (+), MySQL would reserve space for the characters plus an Additiona L 1 or 2 bytes for holding the length of the string (1 when the length was 255 or less, 2 otherwise). So this would bust the myth that "you can safely use VARCHAR (255) for all fields without problems".

Characters and bytes:or the Utf8-problem

Did you notice this I talk about "characters" and "bytes"? That ' s because those, terms is not the same. A byte equals 8 bits, and can hold any number ranging from 0 through 255 (or-128..127, if you have read my and complement blog ). The size of a character however, depends on the character encoding used and here are where the UTF-8 "problem" kicks in. Back in the people stored strings in a latin1 charset, every character could being stored in a Te. Thus:varchar (would) is bytes (+1 for the length). But it's not enough-characters in the world (for instance, Arabic and Japanese characters cannot being stored In Latin1). That's why UTF-8 can with multiple bytes for some characters. The "standard" characters would be stored in 1 byte so most UTF8 strings is almost the same size as latin1 strings, but WH En you need different characters it can use up to 4 bytes per character. If you like to know more about UTF-8, there is excellent other blogs ABOut it.

You just has to realize this MySQL only uses a maximum of 3 bytes for UTF-8, which means no all utf-8 characters can is  stored in MySQL, but most of the UTF-8 characters possible aren ' t used anyway. That's why it's might get confusing when reading upon UTF-8 this uses 4 bytes, and the 3 bytes that MySQL uses.

Let's define a table with an index:

CREATE TABLE' tbl ' (' ID ' )int(Ten) unsigned not NULLauto_increment, ' first_name 'varchar( -)character SetLatin1 COLLATE latin1_general_ci not NULL, ' last_name 'varchar( -)character SetLatin1 COLLATE latin1_general_ci not NULL, ' birth_date ' date not NULL, PRIMARY KEY(' id '),KEY' first_name ' (' first_name ')) ENGINE=MyISAMDEFAULTCHARSET=Latin1

This is creates a simple table with a primary index on ID and only an index on ' first_name '. You need-to-add at least 2 rows, otherwise the explain won't work correctly for this example. So add some data and find out what index would be used when issuing the following query:

SELECT *  from WHERE  like ' Joshua ';

Output:

+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra |
+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
| 1 | Simple | TBL | Range | first_name | first_name | 102 |    NULL | 1 | Using where |
+----+-------------+-------+-------+---------------+------------+---------+------+------+-------------+
1 row in Set (0.00 sec)

The most important field is the Key_len. This field is 102 bytes. Bytes for the VARCHAR (s), since it's encoded with latin-1. The additional 2 bytes here is the length-bytes.

Now, let's adjust the fields to UTF-8:

ALTER  TABLE  ' TBL '  change  ' first_name '  VARCHAR'  CHARACTER   SETnot  NULL;
SELECT *  from WHERE  like ' Joshua ';

Output:

+----+-------------+-------+-------+---------------+------------+---------+--- ---+------+-------------+
| ID | select_type | table | type  | possible_ Keys | key        | Key_len | ref  | Rows | extra       |
+----+-------------+-------+-------+---------------+------------+---------+-- ----+------+-------------+
|  1 | simple      | tbl   | Range | first_name    | first_name | 302     | NULL |    1 | Using where |
+----+-------------+-------+-------+---------------+------------+---------+-- ----+------+-------------+
1 row in Set (0.00 sec)

Immediately you should see the impact. The Key_len is bytes larger, which means so we can hold less index-records in memory, which means more disk reads WH Ich means a slower database.

But it doesn ' t stop at the indexes. As said, this limitation are for all internal buffers. All temporary sorting uses fixed length buffers and tables that is sorted in memory when using Latin1, could just as Easi Ly is moved to a temporary table on disk because of it ' s size. It would perform less efficient because of more disk reads and writes.

Conclusion:

MySQL and it ' s internal working can be insanely complex. It ' s important to never assume anything and test everything. Don ' t convert everything to UTF-8 just because. But do sure you have good reasons not to use a single-byte encoding like latin1. If you need to use the UTF-8 encoding and then make sure so you use the correct sizes. Don ' t everything VARCHAR (255) So at least you can store really long names. The penalties for "disrespecting" the database can and would be severe. :)

about using the UTF-8 fields in MySQL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.