Questions about character encoding?

Source: Internet
Author: User
Chinese in the gb2312 encoding time is 2 bytes, but in the case of Unicode encoding is 1-3 bytes, English is 1 bytes, but in MySQL do not do this, the length of the varchar whether Chinese and English is a word count one, then. Why do you want to do this 1 characters equivalent to 2 English characters?

Reply content:

Chinese in the gb2312 encoding time is 2 bytes, but in the case of Unicode encoding is 1-3 bytes, English is 1 bytes, but in MySQL do not do this, the length of the varchar whether Chinese and English is a word count one, then. Why do you want to do this 1 characters equivalent to 2 English characters?

http://xfhnever.com/blog/2014/12/20/encodingformat/a little bit about the various coding formats

varchar (20) Specifies the length of the character
The character set of the table is specified in MySQL, as follows CHARSET=gbk :

 CREATE TABLE `test_type` (  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,  PRIMARY KEY (`id`),) ENGINE=InnoDB AUTO_INCREMENT=7 DEFAULT CHARSET=gbk COLLATE=gbk_bin

Why do you want to do this 1 characters equivalent to 2 English characters?

This setting is generally based on the natural length of the string (that is, the length of the display), and a Chinese character equals 2 English letters or numbers. For ordinary users is the word, the calculation of bytes is generally not a program and programmers do things. So general program design, generally follow a Chinese character is 2 English letter length of the set, and the specific occupation of space is based on coding and environment.

Reference functionmb_strwidth()

The earliest encoding is ASCII, which is an English-language-oriented encoding, and ASCII defines 0-127, a total of 128-character encodings. For the English so with 26 characters, nature is enough. The question is, what if the other characters used are 26 extra Western text? Since the character encoding in the 0x7f space has been agreed, we can see that some Western text, such as Russian, is completely non-common with English, and requires an ASCII extension set. In the development of coding theory, different coding systems have been used to encode the native language in each region. China is using ISO2022 system, coded Chinese characters commonly used character GB2312, more character GBK, and largest set GB18030.

With a simple thought, you will find that a byte 8bits can only encode a text system like English, and the Chinese characters commonly used has tens of thousands of of thousands. So a minimum of 2 bytes is required.

The advent of Unicode is intended to unify all of the world's character encoding spaces without causing a problem with character encoding conflicts. For example, if you use GBK encoded characters, interpreted with UTF-8 encoding, it can also be literally valid. Unicode encoding has utf-8,utf-16,utf-32, even UTF-7, and many other formats. Commonly said Unicode encoding refers to UTF-16, because it can encode almost all commonly used characters in a uniform way, so if it is not for extreme situations, we can assume that the characters can be represented by UTF-16. In the case of UTF-16 encoding, the length of each character in Chinese and English is 2 bytes, which is equal to long. This feature is used within Windows systems to efficiently process text. UTF-8 application is also very extensive, the advantage is that the storage space savings, decoding more complex.

MySQL has a encoding design on multiple levels, supporting the designation of character encodings at different levels.

But obviously, we should recommend using UTF-8 encoding at all levels. If it is in-memory processing, it is recommended to use UTF-16, in fact, the various languages in the design, it seems that the same design.

  • Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.