Why GBK is used for utf8 International?

Source: Internet
Author: User
Tags comparison table

Preface:

I have always been obsessed with coding, so it is not good to make the data clear at a glance? Why coding? Isn't it idle? Is it a waste of productivity?

As I learned more deeply, I became more ignorant. There are tens of millions of languages in the world. How do I express them in the Internet world? When it comes to Chinese ASCII, it seems pale and powerless. Therefore, more bytes are needed to represent various types of text, which requires a specification;

At the same time, the purpose of codec is to ensure data security. This technology is gradually applied in various aspects, such as base64 in emails. This article mainly discusses the international encoding utf8, why do many Chinese websites stick to GBK?

Body:

Utf8 is an international code. It has good versatility and allows foreigners to browse Chinese websites;

GBK is a domestic code. gb2312 is applicable to simplified Chinese characters and is less universal than utf8, but it has space to use;

The following message is displayed on the Internet: if the customer base of your website is mainly for domestic users, we recommend that you use the GBK version because it can save space and be stable compared with the UTF-8 version.

In fact, the GBK version and the UTF-8 version of the function is the same. But the encoding method is different.

The GBK text encoding is expressed in dubyte, that is, both Chinese and English characters are expressed in dubyte, except that the maximum bit of GBK is set to 1 to distinguish Chinese characters.

As for UTF-8 encoding, It is variable-length, used to address a multi-byte encoding of international characters, it uses 8 bits (I .e. one byte) for English ), chinese characters are encoded in 24 bits (three bytes. For websites with more English characters, use a UTF-8 to save space.

GBK contains all Chinese characters; UTF-8 contains all the characters needed by all countries around the world.

GBK is based on the national standard gb2312 after expansion compatible with gb2312 standard, UTF-8 encoding text can be in various countries support utf8 character set on the browser display.
For example, if it is UTF-8 encoded, Chinese characters can also be displayed on Internet Explorer of foreigners, without the need to download the Chinese language support package for Internet Explorer. Therefore, for websites with more English, using GBK occupies 2 bytes for each character, while using UTF-8 English occupies only one byte.

Simply put, UTF-8 contains many world-wide language codes (equivalent to a large font) such as Japanese and Korean can be used. GBK and gb2312 are only Chinese characters, which are equivalent to the number of Chinese characters contained. It is better to use UTF-8 for convenience.

Briefly describe the principle
All strings are Unicode characters, including English letters, Chinese characters, and other characters.
Unicode can be converted into bytes. How can we convert Unicode into bytes through encoding methods such as GBK, UTF-8, and gb2312. When the bytes are converted back to the string, the corresponding encoding method is also used for conversion.
For example, when the client converts the Unicode code to a binary stream (that is, byte) through UTF-8 and the server uses GBK encoding to convert the binary stream to a string (UNICODE, garbled characters appear. It must also be converted using the corresponding encoding method.
GBK (encode) GBK (decode)
Unicode ----------------> byte [] ------------------------------> Unicode
String binary stream string

Universal conversion format UTF (Universal Transformation Format)
UTF-8 is a variable-length character encoding of Unicode
GBK: Chinese National Standard extended code, basically using the original GB2312-80 all Chinese characters and code bit,
And cover all the Chinese characters 20902 in the original Unicode
GBK encoding is a new Chinese code developed in mainland China and equivalent to the UCS to expand the national standard.
GB 2312 or GB 2312-80 is a Chinese national standard for simplified Chinese character sets
A total of 2312 Chinese characters are included in the GB 6763 Standard

The difference is simple. Different character sets.
What is character set is actually a comparison table. "Ah" the Unicode encoding corresponding to the word is "554a", and the GBK encoding corresponding to the word is "b0a1 ".
What is a comparison table? A simple example is the class roster. "Zhang San" corresponds to No. 10. The computer cannot store "Michael Jacob", but it stores "10", extracts a 10, queries it according to the table of comparison, and returns a "Michael Jacob ".
However, "Zhang San" corresponds to the 15th day in the node roster of the school. Different Node lists are different character sets.
Now you have saved "10". After obtaining it, You have retrieved "Zhang San" from the class roster. If you have used the school roster, you will be returned to a foreigner, if you do not know English, you will feel garbled. In fact, it is not messy. You can return it to 10, and then compare it with the class roster. "James" won't come out. NOTE: If there are no 10 Matched persons in the node roster in the school, return them to you as agreed "???? ", At this time, it is really garbled, and it cannot be returned.
Different Node lists are different character sets. Gb2312 is a roster. Later, two students were transferred into the class, and two students were added to gb2312. The teacher said that the roster is different. Change the name to GBK.
There is a student named "Zhang San" in China. In China it is "10". Later, when I went abroad, the United Nations gave him a number: "20", a GBK, and a utf8, but they all belong to one person and correspond to different numbers.
PS: The UTF-8 is a variant of Unicode, From Unicode encoding can calculate a word corresponds to the UTF-8 encoding.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.