Unicode, GBK, UTF-8 difference]

Source: Internet
Author: User
Tags new set rfc
Unicode, GBK, UTF-8 differences in simple terms, Unicode, GBK and Big Five code is the encoded value, and UTF-8, uft-16 and so on is the representation of this value. the preceding three types of codes are compatible. The values of the three codes are completely different for the same Chinese character. for example, the uncode Value of "Han" is different from that of GBK, assuming that uncode is a040, GBK is b030, and uft-8 code is the form of expressing that value. UTF-8 code is only for uncode to organize, if GBK to convert UTF-8 must first to uncode code, and then to UTF-8 OK.
For details, refer to the following article.

I will talk about unicode encoding and briefly explain the terminologies such as UCOS, UTF, BMP, and BOM.
This is an interesting book written by programmers. The so-called fun refers to the ability to easily understand some previously unclear concepts and enhance knowledge, similar to upgrading RPG games. There are two reasons for organizing this article:

Question 1:
Using the Save As in Windows notepad, you can convert between GBK, Unicode, Unicode big endian, and UTF-8 encoding methods. It is also a TXT file. How does Windows identify the encoding method?

I found that Unicode, Unicode big endian, and UTF-8-encoded TXT files start with several more bytes, namely ff, Fe (UNICODE ), fe and FF (UNICODE big endian), EF, BB, BF (UTF-8 ). But what standards are these tags based on?

Question 2:
A convertutf. C was recently seen online, implementing mutual conversion between UTF-32, UTF-16, and UTF-8. I used to understand Unicode (ucs2), GBK, and UTF-8 encoding methods. But this program makes me a little confused, don't remember what the UTF-16 and ucs2 has.
After checking relevant information, I finally figured out these problems and learned some Unicode details. Write an article and send it to friends with similar questions. This article tries its best to be easy to understand when writing, but requires readers to know what is byte and what is hexadecimal.

0, big endian, and little endian
Big endian and little endian are different ways for CPUs to process the number of multi-word segments. For example, the Unicode code of the Chinese character is 6c49. When I write a file, do I write 6C in front or 49 in front? If you write 6C in front, it is big endian. If you write 49 in front, it is little endian.

The word "endian" comes from Gulliver Travel Notes. The civil war in the little man's country originated from the fact that when I was eating eggs, I knocked out big-Endian or little-Endian. As a result, there were six rebels and an emperor gave me his life, the other lost the throne.

We generally translate endian into byte order, and call Big endian and little endian "Big tail" and "Small Tail ".

1. character encoding and inner code. This section introduces Chinese character encoding.
The character must be encoded before it can be processed by the computer. The default encoding method used by the computer is the computer's internal code. Early computers used 7-bit ASCII code. To process Chinese characters, programmers designed gb2312 for simplified Chinese and big5 for traditional Chinese.

Gb2312 (1980) contains a total of 7445 characters, including 6763 Chinese characters and 682 other symbols. The inner code range of the Chinese character area is high byte from the B0-F7, low byte from the A1-FE, the occupied bitwise of the Code is 72*94 = 6768. Five of them are D7FA-D7FE.

Gb2312 supports too few Chinese characters. The Chinese character extension specification gbk1.0 in 1995 contains 21886 characters, which are divided into Chinese Character areas and graphic symbol areas. The Chinese Character area contains 21003 characters.

From ASCII, gb2312 to GBK, these encoding methods are backward compatible, that is, the same character always has the same encoding in these schemes, and the following standards support more characters. In these encodings, both English and Chinese can be processed in a unified manner. The difference between Chinese encoding is that the maximum bit of a high byte is not 0. According to programmers, gb2312 and GBK both belong to the dual-byte character set (DBCS ).

In 2000, gb18030 replaced gbk1.0 with the official national standard. The standard includes 27484 Chinese characters, as well as Tibetan, Mongolian, and Uyghur texts. In terms of Chinese Character collection, gb18030 adds 20902 Chinese characters (UNICODE code 0x3400-0x4db5) of CJK extension A based on the 6582 Chinese Characters in gb13000.1, and contains a total of 27484 Chinese characters.

CJK is what China, Japan, and South Korea mean. UNICODE: in order to save the bitwise, Unicode encodes the text in the three languages of China, Japan, and South Korea. Gb13000.1 is the Chinese version of ISO/IEC 10646-1, which is equivalent to Unicode 1.1.

The gb18030 encoding adopts the single-byte, dual-byte, and 4-byte encoding schemes. The single-byte, dual-byte, and GBK are fully compatible. The 4-byte encoding code bit contains 6582 Chinese characters of CJK extension. For example, the 0x3400 of the ucos must be encoded in gb18030 as 8139ef30, And the 0x3401 of the ucos must be encoded in gb18030 as 8139ef31.

Microsoft provides the gb18030 upgrade package, but this upgrade package only provides a new set of 6582 Chinese characters supporting CJK extension A: New --18030, without changing the internal code. The internal code of Windows is still GBK.

Here are some details:

The original gb2312 text is still a location code. From the location code to the inner code, you need to add A0 to the high byte and low byte respectively.

For any character encoding, the encoding unit sequence is specified by the encoding scheme and is irrelevant to the endian. For example, the GBK encoding unit is byte, and two bytes are used to represent a Chinese character. The order of these two bytes is fixed and is not affected by the CPU's byte order. The encoding unit of the UTF-16 is word (double byte), the order between words is specified by the encoding scheme, the byte arrangement inside the word will be affected by the endian. UTF-16 will also be introduced later.

The maximum bits of two gb2312 bytes is 1. However, only 128*128 = 16384 digits are allowed. Therefore, the highest bit of the low byte of GBK and gb18030 may not be 1. However, this does not affect the parsing of the DBCS audio stream: when reading the DBCS audio stream, the next two bytes can be used as a dual-byte encoding as long as the high byte is 1, you don't need to worry about the high of low bytes.

2. Unicode, UCS, and UTF
The encoding methods from ASCII, gb2312, GBK to gb18030 are backward compatible. Unicode is only compatible with ASCII (more accurately, it is compatible with the ISO-8859-1) and is not compatible with the GB code. For example, the Unicode code of the Chinese character is 6c49, And the GB code is Baba.

Unicode is also a character encoding method, but it is designed by international organizations to accommodate all languages and texts in the world. The Unicode name is "Universal multiple-octet coded character set", which is short for UCOS. UCOS can be seen as the abbreviation of "Unicode Character Set.

According to Wikipedia (http://zh.wikipedia.org/wiki/), there have been two organizations in history attempting to design Unicode independently, namely the International Organization for Standardization (ISO) and the Association of a software manufacturer (unicode.org ). ISO has developed the ISO 10646 project, and the Unicode Association has developed the Unicode project.

Around 1991, both parties realized that the world does not need two incompatible character sets. As a result, they began to merge their work results and work together to create a single coding table. Since unicode2.0, the Unicode project adopts the same font and character code as ISO 10646-1.

Currently, both projects still exist and their respective standards are published independently. The latest version of Unicode Association is Unicode 4.1.0 in 2005. The latest ISO standard is ISO 10646.

UCOS only specifies how to encode and does not specify how to transmit and save the encoding. For example, the "Han" character's UCS encoding is 6c49. I can use four ASCII numbers to transmit and save this encoding, or use UTF-8 encoding: it is represented by three consecutive bytes E6 B1 89. The key lies in the approval of both parties. UTF-8, UTF-7, UTF-16 are widely accepted solutions. A special benefit of the UTF-8 is that it is fully compatible with the ISO-8859-1. UTF is the abbreviation of "UCS Transformation Format.

IETF rfc2781 and rfc3629 with RFC consistent style, clear, bright and rigorous description of the UTF-16 and UTF-8 coding method. I cannot remember that IETF is short for Internet Engineering Task Force. However, the RFC maintained by IETF is the basis of all regulations on the Internet.

2.1. Inner code and code page
Currently, the Windows kernel supports the Unicode Character Set, which supports all languages and texts in the world. However, because a large number of existing programs and documents use encoding in a specific language, such as GBK, Windows cannot support the existing encoding, but all use Unicode.

Windows uses the code page to adapt to different countries and regions. Code page can be understood as the internal code mentioned above. The code page corresponding to GBK is cp936.

Microsoft also defined code page: cp54936 for gb18030. However, because gb18030 has a part of 4-byte encoding, while the Windows code page only supports single-byte and double-byte encoding, this code page cannot be used.

3, UCS-2, UCS-4, BMP
The UCS has two formats: UCS-2 and UCS-4. As the name suggests, UCS-2 is to use two bytes of encoding, UCS-4 is to use 4 bytes (actually only 31 bits, the highest bit must be 0) encoding. Let's make some simple mathematical games:

UCS-2 has 2 ^ 16 = 65536 bits, UCS-4 has 2 ^ 31 = 2147483648 bits.

The UCS-4 is divided into 2 ^ 7 = 128 groups based on the highest byte with the highest bit of 0. Each group is further divided into 256 Plane Based on the next high byte. Each plane is divided into 3rd rows (rows) based on 256 bytes, and each row contains 256 cells. Of course, cells in the same row are only different from the last byte, and the rest are the same.

The plane 0 of group 0 is called Basic multilingual plane, that is, BMP. Or in the UCS-4, the code bit with the height of two bytes 0 is called BMP.

Remove the bmp of the UCS-4 from the first two zero bytes to get the UCS-2. Add two zero bytes before the two bytes of the UCS-2 to get the bmp of the UCS-4. Currently, the UCS-4 specification does not contain any characters other than BMP.

4. UTF Encoding

The UTF-8 is coded in 8 bits. The encoding from UCS-2 to UTF-8 is as follows:

UCS-2 encoding (HEX) UTF-8 byte stream (Binary)
0000-007f 0 xxxxxxx
0080-07ff 110 XXXXX 10 xxxxxx
0800-FFFF 1110 XXXX 10 xxxxxx 10 xxxxxx

For example, the Unicode code of the Chinese character is 6c49. 6c49 is between 0800-ffff, so it must use a 3-byte template: 1110 XXXX 10 xxxxxx 10 xxxxxx. Write 6c49 as binary: 0110 110001 001001. Use this bit stream to replace X in the template. The result is 11100110 10110001 10001001, that is, E6 B1 89.

Readers can use NotePad to test whether our encoding is correct. Note that ultraedit automatically converts to UTF-16 when opening a UTF-8-encoded text file, which may produce confusion. You can disable this option in settings. A better tool is hex workshop.

The UTF-16 is encoded in 16 bits. The UTF-16 code is equal to the 16-bit unsigned integer corresponding to the UCS code for a UCS code that is less than 0x10000. An algorithm is defined for the UCS code not less than 0x10000. However, because the actual use of ucs2, or ucs4 BMP must be less than 0x10000, so for now, it can be considered that the UTF-16 and UCS-2 are basically the same. But UCS-2 is only a coding scheme, UTF-16 is used for actual transmission, so we have to consider the problem of byte order.

5. UTF byte order and BOM
The UTF-8 is encoded in bytes and there is no issue of bytecode. The UTF-16 uses two bytes as the encoding unit. before interpreting a UTF-16 text, you must first understand the byte order of each encoding unit. For example, the Unicode encoding of "Kui" is 594e and that of "B" is 4e59. If we receive the UTF-16 byte stream "594e", is this "Kui" or "B "?

The recommended method for marking byte order in Unicode specifications is Bom. Bom is not a "bill of material" Bom, but a byte order mark. Bom is a bit clever:

There is a character named "Zero Width no-break space" in the UCS encoding, and its encoding is feff. Fffe does not exist in the UCs, so it should not appear in actual transmission. We recommend that you transmit the character "Zero Width no-break space" before transmitting the byte stream in the UCS specification ".

In this way, if the receiver receives feff, it indicates that the byte stream is big-Endian; if it receives fffe, it indicates that the byte stream is little-Endian. Therefore, the character "Zero Width no-break space" is also called Bom.

The UTF-8 does not need BOM to indicate the byte order, but BOM can be used to indicate the encoding method. The UTF-8 code for the character "Zero Width no-break space" is ef bb bf (the reader can verify it with the encoding method we described earlier ). So if the receiver receives a byte stream starting with ef bb bf, it will know that this is UTF-8 encoding.

Windows uses BOM to mark the encoding of text files.

6. Further references
This article mainly references "Short overview of ISO-IEC 10646 and Unicode" (http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html ).

I also found two documents that looked good, but I did not read them because I found the answers to my first questions:

"Understanding Unicode a general introduction to the Unicode Standard" (http://scripts.sil.org/cms/scripts/page.php? Site_id = nrsi & item_id = IWS-Chapter04a)
"Character set encoding basics understanding Character Set encodings and legacy encodings" (http://scripts.sil.org/cms/scripts/page.php? Site_id = nrsi & item_id = IWS-Chapter03)
I have written software packages for UTF-8, UCS-2, GBK mutual conversion, including versions that use Windows APIs and that do not use Windows APIs. If you have time later, I will sort it out and put it on my personal homepage (http://fmddlmyy.home4u.china.com ).

I started to write this article only after thinking clearly about all the questions. I thought I could write it later. I did not expect that it took a long time to consider wording and verification details, but it was written from PM. We hope that some readers will benefit from this.

Appendix 1 area code, gb2312, internal code, and code page
Some may have questions about this sentence in the article:
"The original gb2312 text is still a location code. From the location code to the inner code, you need to add A0 to the high byte and low byte respectively ."

I will explain in detail:

The original version of gb2312 refers to the Chinese character encoding set for the national standard information exchange of the People's Republic of China, a standard of the People's Republic of China in 1980, GB 2312-80. This standard uses two numbers to encode Chinese characters and Chinese characters. The first number is called "area", and the second number is called "bit ". So it is also called a location code. Area 1-9 is a Chinese character, Area 16-55 is a level-1 Chinese character, and Area 56-87 is a level-2 Chinese character. There is also a location Input Method in windows. For example, enter 1601 to get "ah ". (This location input method can automatically identify the hexadecimal gb2312 and 10th encoding codes. That is to say, the input b0a1 will also get "ah ".)

Internal Code refers to the character encoding in the operating system. The internal code of the early operating system is language-related. Currently, Windows supports Unicode in the system and uses the code page to adapt to various languages. The concept of "Internal code" is vague. Microsoft generally describes the encoding specified by the default code page as an internal code.

There is no official definition of the word "incode". The code page is just the name of Microsoft. As programmers, as long as we know what they are, there is no need to study these terms too much.

The code page is a character encoding for a language. For example, the code page of GBK is cp936, the code page of big5 is cp950, and the code page of gb2312 is cp20936.

In Windows, the default code page is used to interpret characters. For example, in Windows notepad, a text file is opened, which contains byte streams: Ba, ba, D7, and D6. How should I explain it in windows?

Is it interpreted by Unicode encoding, GBK, big5, or by ISO8859-1? If you use GBK for explanation, you will get two Chinese characters. According to other encoding instructions, the corresponding characters may not be found or the wrong characters may be found. The so-called "error" means that it is inconsistent with the original intention of the text author, and garbled characters are generated.

The answer is that Windows interprets byte streams in text files according to the current default code page. The default code page can be set through the region option of the control panel. There is an ANSI in saving notepad as it is actually saved according to the encoding method of the default code page.

The internal code of Windows is Unicode, which technically supports multiple code pages at the same time. As long as the file shows the encoding you are using and the corresponding code page is installed, windows will be able to display it correctly. For example, charset can be specified in the HTML file.

Some HTML file authors, especially English authors, think that everyone in the world uses English, and charset is not specified in the file. If the character 0x80-0xff is used, and the Chinese Windows explains it according to the default GBK, garbled characters will appear. In this case, you only need to add the charset statement to the HTML file, for example:
<Meta http-equiv = "Content-Type" content = "text/html; charset = ISO8859-1">
If the original author uses a code page that is compatible with the ISO8859-1, there will be no gibberish.

Besides, the location code is 1601, And the hexadecimal format is 0x10, and 0x01. This is in conflict with the widely used ASCII code in computers. To be compatible with the ASCII code of 00-7f, we add A0 to the high and low bytes of the location code. In this way, the "ah" encoding becomes b0a1. We will add two A0 codes, also known as gb2312 encoding, although this is not mentioned in the original version of gb2312.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.