What are the differences between GB18030 and UTF-8 GB18030 and what is the difference between UTF-8
Reply content:
What is the difference between GB18030 and UTF-8?
GB18030
It is a Chinese standard. The national standard (GB) represents a character. Unicode only gives the number of one character, and does not specify how to represent (or save ),UTF-8
Specifies how to represent. So,GB18030AndUnicode + UTF-8Are different character representation methods. One is the standard set by China and the other is the standard set by international organizations.
When a computer was invented, people thought it would not be as popular as it is now, so people, including control characters, defined only 128 types of symbols including control characters, which isASCII
.
Later, computers became popular in non-English countries, so they had their own language to be displayed by computers. Since there are only 128 ASCII characters, and a computer has 8 bytes, there are 128 redundant characters, so they use the remaining 128 redundant characters to represent the texts of their own country. Different countries define these 128 (actually 95) as different characters, respectively called ISO8859-1 (Latin-1), ISO8859-2 (Latin-2 )..... ISO8859-16 (Latin-10 ). Well, some of them are not called Latin)
But in East Asia, let's talk about China. The first 128 Chinese characters clearly do not represent all Chinese characters. Therefore, only two bytes can be used to represent one Chinese character. Therefore, it is stipulated that the original ASCII is represented by one byte, and two consecutive bytes (both of which are greater than 128) are used to represent a Chinese character. A total of 128x128 = 16384 Chinese characters (actually not so many), called GB2312. Later, people found themselves silly. In fact, if the first byte is greater than 128, we can use two consecutive bytes to indicate a Chinese character without ambiguity, if the current byte is less than 128, it is the standard ASCII. If the current byte is greater than 128, the Current byte and the next byte are used to represent a Chinese character, so it can be expressed (128 × 256 = 32768) and then added some Chinese characters, called GBK. On this basis, some Chinese characters are added. This version is calledGB18030
. (The East Asian standard also has the story of BIG5 and CJK ).
Different countries have different standards, so it is inconvenient to communicate with each other. So there were two organizations (forgetting what it was called) and started to unify all the character sets (one of them thought another organization was doing well and took the initiative to exit), called Unicode.
However, Unicode only specifies the number of one character and does not specify how to represent it. For example, if A is numbered 65, it can be expressed as A byte 0x41, or 0x00 0x41 in two bytes, or use four bytes to indicate 0x00 0x00 0x00 0x41, and if multiple bytes are used to represent them, Which of the following is the problem. So there are different standards in unicode representation.UTF-8
Is a representation of the standard (finally speaking of UTF-8), but how to express it is a bit complex, is a variable-length encoding, some characters with a byte (compatible with ASCII, that's why UTF-8 is more popular) Some use two and some three... Baidu uses a representation to convert a number into several bytes.
Extended: There are two standard UTF-16, UTF-32, and UTF-16 that are unified to represent a character in two or four bytes (this encoding is in java, javascript, this also has the USC Fixed Length Encoding story), The UTF-32 unified with 4 bytes, so are not compatible with ASCII. Since it is multi-byte, it involves the issue of byte order ....
I strongly agree with @ zonxin, but I would like to say:
Cherish your life and stay away from GBK/GB2312 (of course, it's a little extreme, but that's what ye said !)
Of course, utf8 is a problem. Microsoft has compiled utf8 with bom and utf8 without bom. Is it easy for Madan to be a programmer!
The former is a Chinese character set, and the latter is a Wanguo code, which is completely different from the character set encoding! I suggest you have a good understanding of character encoding. For more information, see this article.
The former is a character set, which is equivalent to "what to say", and the latter is a encoding method, which is equivalent to "what to say ".