UTF-8 and GBK encoding differences

Source: Internet
Author: User

UTF-8:

It is a multi-byte encoding for international characters. It uses 8 bits (one byte) for English and 24 bits (three bytes) for Chinese characters.

GBK

It is compatible with gb2312 after expansion based on the National Standard gb2312. The GBK text encoding is expressed in double bytes, that is, both Chinese and English characters are expressed in Double Bytes. To distinguish Chinese characters, set the highest bit to 1. GBK contains all Chinese characters and is a national code. Its versatility is worse than utf8, but the database occupied by utf8 is larger than that occupied by GBD.


Generally, all webpages use UTF-8, because a large amount of HTML code in the webpage does not occupy space when UTF-8 is used.

The UTF-8-encoded database varchar (30) can store up to 10 Chinese characters, because one Chinese Character occupies three bytes.



Varchar (20) in versions earlier than 4.0 refers to 20 bytes. If utf8 Chinese characters are stored, only 6 (3 bytes for each Chinese character) can be saved. If varchar (20) is later than 5.0, varchar (20) is a string of 20 characters. It can contain 20 numbers, letters, and UTF-8 characters (3 bytes for each Chinese character). The maximum size is 65532 bytes; varchar (20) is the largest but only 20 bytes in mysql4. However, the storage size of mysql5 varies depending on the encoding. The specific rules are as follows:
A) Storage restrictions
The varchar field stores the actual content separately in the clustered index. The content starts with 1 to 2 bytes to indicate the actual length (2 bytes if the length exceeds 255 ), therefore, the maximum length cannot exceed 65535.
B) encoding length limit
If the character type is GBK, each character occupies a maximum of 2 bytes, and the maximum length cannot exceed 32766;
If the character type is utf8, each character occupies up to 3 bytes, and the maximum length cannot exceed 21845.
If the preceding limits are exceeded during definition, the varchar field is forcibly converted to the text type and generates a warning.


For C Language

According to the compiler, different compilers have different rules. The ANSI standard defines that int occupies 2 bytes, TC is ANSI, and its int occupies 2 bytes. However, in VC, an int occupies 4 bytes.


UTF-8 and GBK encoding differences

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.