The difference between UTF-8 GBK UTF8 GB2312 and related skills

Source: Internet
Author: User
Tags rfc
UTF-8 contains the characters that all countries in the world need to use, is international code, strong universality. UTF-8 encoded text can be displayed on browsers that support UTF8 character sets in countries. For example, if the UTF8 code, the foreigner in English IE can also display Chinese, they do not need to download IE Chinese Language support package.

GBK is the standard of GB2312 compatible GB2312 on the basis of national standard. GBK's text encoding is expressed in two-byte notation, that is, both Chinese and English characters are expressed in double-byte notation, and the highest bits are set to 1 in order to differentiate Chinese. GBK contains all the Chinese characters, is the country code, the universality is worse than the UTF8, but UTF8 occupies the database to be bigger than the GBD.

GBK, GB2312, and UTF8 must all be converted to each other through Unicode encoding:

GBK, Gb2312--unicode--utf8

UTF8--UNICODE--GBK, GB2312

For a website, forum, if English characters are more, it is recommended to use UTF-8 to save space. But now many forum plug-ins generally only support GBK.
The difference between the codes is explained in detail
In short, the UNICODE,GBK and the big five yards are encoded values, and the utf-8,uft-16 is the representation of the value. The previous three codes are compatible, the same characters, and the three yards are completely different. such as "Han" Uncode value and GBK is not the same, assuming that uncode for A040,GBK b030, and Uft-8 code, that is the form of the value expression. Utf-8 code solely for Uncode to organize, if GBK to turn UTF-8 must first turn Uncode code, and then turn to Utf-8 OK.

In detail, see the following turn of this article.

Talk about Unicode code, briefly explain UCS, UTF, BMP, BOM and other nouns
This is an interesting reading for programmers to write to programmers. The so-called fun refers to a relatively easy to understand some of the original unclear concepts, improve knowledge, similar to playing RPG game upgrades. The motivation for organizing this article is two questions:

Question one:
Use the Save As for Windows Notepad to convert between GBK, Unicode, Unicode big endian, and UTF-8 encoding methods. Also is TXT file, how does Windows recognize the encoding way?

I found out earlier that Unicode, Unicode Bigendian, and UTF-8 encoded TXT files are preceded by a few more bytes, namely FF, Fe (Unicode), Fe, FF (Unicode Bigendian), EF, BB, BF (UTF-8). But what criteria are these tags based on?

Question two:
Recently saw a CONVERTUTF.C on the net, realized the UTF-32, UTF-16 and UTF-8 these three kinds of coding way conversion mutually. For Unicode (UCS2), GBK, UTF-8 these encodings, I used to know. But this program makes me a little confused, can not think of UTF-16 and UCS2 have any relationship.
Check the relevant information, and finally make the problem clear, incidentally also understand some of the details of Unicode. Write an article for a friend who has had similar questions. This article tries to be easy to understand when writing, but asks the reader to know what is byte and what is hexadecimal.

0, Big endian and little endian
Big endian and Littleendian are different ways for CPUs to handle multibyte numbers. For example, the Unicode encoding of the word "Han" is 6c49. So when you write to a file, do you write 6C in front of it or write 49 in front? If you write 6C in front, it's big endian. If the 49 is written in front, it is little endian.

The word "endian" is derived from Gulliver's Travels. Lilliput's civil war stems from eating eggs from the big Head (Big-endian) knock Open or from the beginning (Little-endian), which has occurred six times rebellion, one Emperor gave life, another lost the throne.

We generally translate endian into "byte order", the big endian and little endian called "large tail" and "small tail".

1, character code, inside code, incidentally introduced encoding
Characters must be encoded before they can be processed by the computer. The default encoding used by your computer is the internal code of your computer. Early computers used 7-bit ASCII encoding, and in order to handle Chinese characters, programmers designed GB2312 for Simplified Chinese and big5 for traditional Chinese.

GB2312 (1980) contains a total of 7,445 characters, including 6,763 characters and 682 other symbols. The inner code range of the Chinese character area is high byte from B0-f7, low byte from A1-fe, and the code position occupied is 72*94=6768. 5 of these vacancies are d7fa-d7fe.

GB2312 supports too few Chinese characters. In the 1995, the Chinese character extension standard GBK1.0 21,886 symbols, which are divided into Chinese character area and graphic symbol area. The Chinese character area consists of 21,003 characters.

From ASCII, GB2312 to GBK, these coding methods are backward-compatible, that is, the same character always has the same encoding in these scenarios, and the following standard supports more characters. In these codes, English and Chinese can be treated in a uniform manner. The method for distinguishing Chinese encodings is that the highest bit of high byte is not 0. According to the programmer's name, GB2312, GBK are both Double-byte character sets (DBCS).

The 2000 GB18030 is the official national standard to replace GBK1.0. The standard contains 27,484 Chinese characters, and also includes Tibetan, Mongolian, Uighur and other major minority languages. According to the Chinese vocabulary, GB18030 6,582 Chinese characters (Unicode code 0X3400-0X4DB5) on the basis of GB13000.1 's 20,902 Chinese characters, and a total of 27,484 Chinese characters were included.

CJK is the meaning of China, Japan and South Korea. Unicode in order to save code position, the Chinese and Japanese South Korea, the language of the three languages unified coding. GB13000.1 is the Chinese version of ISO/IEC 10646-1, which is equivalent to Unicode 1.1.

The GB18030 encoding uses Single-byte, Double-byte, and 4-byte schemes. Where Single-byte, Double-byte, and GBK are fully compatible. The 4-byte coded code bit is a collection of 6,582 Chinese characters in CJK Extension A. For example: UCS's 0x3400 encoding in GB18030 should be 8139ef30,ucs 0x3401 in GB18030 encoding should be 8139ef31.

Microsoft has provided a GB18030 upgrade package, but this upgrade package only provides a new font for 6,582 Chinese characters that support CJK Extension A: The new Arial-18030 does not change the inner code. The inner code of Windows is still GBK.

Here are some details:

GB2312 of the original or location code, from the location code to the inner code, the need for high byte and low byte, respectively, plus A0.

For any character encoding, the Order of the encoding unit is specified by the encoding scheme, regardless of the endian. For example, the GBK encoding unit is a byte, representing a Chinese character in two bytes. The order of these two bytes is fixed and is not affected by the CPU byte order. UTF-16 's encoding unit is word (double-byte), the order between word is specified by the encoding scheme, and the byte arrangement within Word is affected by endian. UTF-16 will also be introduced later.

The highest digits of the GB2312 's two bytes are 1. But the code position that meets this condition only 128*128=16384. So the highest bit of GBK and GB18030 are probably not 1. However, this does not affect the parsing of DBCS character streams: When reading a DBCS character stream, you can encode the next two bytes as a double-byte if you encounter a byte with a high of 1, regardless of what the low byte high is.

2. Unicode, UCS and UTF
Previous coding methods from ASCII, GB2312, GBK to GB18030 are backward-compatible. Unicode is compatible with ASCII only (or, more accurately, iso-8859-1 compatible), and is not compatible with GB code. For example, the Unicode encoding of the word "Han" is 6c49, while the GB code is baba.

Unicode is also a method of character encoding, but it is designed by an international organization to accommodate coding schemes for all languages in the world. The scientific name of Unicode is "Universalmultiple-octet coded Character Set", referred to as UCS. UCS can be considered as the abbreviation for "Unicode CharacterSet".

According to the Wikipedia Encyclopedia (http://zh.wikipedia.org/wiki/): There are two organizations in history that try to design Unicode independently, namely the International Organization for Standardization (ISO) and the Association of Software Manufacturers (unicode.org). ISO 10646 project was developed and the Unicode Association developed the Unicode Project.

Around 1991, both sides realized that the world did not need two incompatible character sets. They began to combine the work of both sides and work together to create a single coding table. Starting with Unicode2.0, the Unicode project uses the same font and codewords as the ISO 10646-1.

Currently, two projects are still in existence and their respective standards are published independently. The latest version of the Unicode Association now is the 2005 Unicode 4.1.0. The newest standard of ISO is ISO 10646-3:2003.

UCS just rules how to encode, and does not specify how to transfer and save this encoding. For example, the UCS encoding of the word "Han" is 6c49, I can transmit and save this encoding with 4 ASCII digits, or it can be represented by UTF-8 encoding: 3 consecutive bytes E6 B189. The key is to recognize both sides of the communication. UTF-8, UTF-7, UTF-16 are widely accepted programs. A special benefit of UTF-8 is that it is completely compatible with iso-8859-1. UTF is the abbreviation for "UCS transformation Format".

The RFC2781 and RFC3629 of the IETF describe the coding methods of UTF-16 and UTF-8 in a consistent style of RfC, clear, crisp, and rigorous. I always remember that the IETF is an abbreviation for the Internet Engineering Task force. However, the RFC is responsible for maintaining the RFCs as the basis for all the specifications on the Internet.

2.1, Inside Code and code page
Currently, the kernel of Windows already supports the Unicode character set, which enables all language languages in the world to be supported on the kernel. However, because a large number of existing programs and documents are in a language-specific encoding, such as gbk,windows, it is not possible to support existing encodings, and all use Unicode instead.

Windows uses code pages to accommodate individual countries and regions. The code page can be understood as the inner code mentioned earlier. GBK the corresponding code page is CP936.

Microsoft also defines the code page:cp54936 for GB18030. However, because the GB18030 has a portion of 4-byte encodings, and Windows code pages support only Single-byte and Double-byte encodings, then this code page is not really used.

3, UCS-2, UCS-4, BMP
UCS has two forms: UCS-2 and UCS-4. As the name suggests, UCS-2 is encoded with two bytes, UCS-4 is encoded with 4 bytes (actually only 31 bits and the highest bit must be 0). Now let's do some simple math games:

UCS-2 has 2^16=65536 code position, UCS-4 has 2^31=2147483648 code position.

UCS-4 is divided into 2^7=128 group based on the highest byte of the highest bit of 0. Each group is then divided into 256 plane according to the secondary high byte. Each plane is divided into 256 rows (rows) based on the 3rd byte, and each row contains 256 cells. Of course, the cells of the same line is only the last byte, the rest is the same.

Plane 0 of group 0 is called Basic multilingual Plane, or BMP. Or, in UCS-4, a code position with a height of two bytes 0 is called BMP.

The UCS-4 bmp is removed from the previous two 0 bytes to get UCS-2. In the UCS-2 two bytes before adding two 0 bytes, you get UCS-4 bmp. However, none of the characters in the current UCS-4 specification are allocated outside of BMP.

4, UTF code

The UTF-8 is to encode the UCS in 8-bit units. The encoding method from UCS-2 to UTF-8 is as follows:

UCS-2 Code (16) UTF-8 byte stream (binary)
0000-007f 0xxxxxxx
0080-07FF 110xxxxx 10xxxxxx
0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx

For example, the Unicode encoding of the word "Han" is 6c49. 6c49 between 0800-FFFF, so be sure to use the 3-byte template: 1110xxxx 10xxxxxx10xxxxxx. The 6c49 is written as binary: 0110 110001 001001, which in turn replaces the X in the template with the following: 1110011010110001 10001001, or E6 B1 89.

Readers can use Notepad to test whether our coding is correct. Note that UltraEdit is automatically converted to UTF-16 when opening a utf-8 encoded text file, which can be confusing. You can turn off this option in the settings. The better tool is hex Workshop.

The UTF-16 encodes the UCS in 16-bit units. For UCS codes less than 0x10000, the UTF-16 encoding is equal to the 16-bit unsigned integer corresponding to the UCS code. For a UCS code that is not less than 0x10000, an algorithm is defined. However, because the actual use of UCS2, or UCS4 bmp is necessarily less than 0x10000, so for now, can be considered UTF-16 and UCS-2 basically the same. But UCS-2 is just a coding scheme, UTF-16 is used for actual transmission, so we have to consider the problem of byte-order.

5, UTF byte order and BOM
UTF-8 is a byte-encoded unit with no byte-order problem. UTF-16 is a two-byte coding unit, before interpreting a UTF-16 text, first figure out the byte order of each encoding unit. For example, the Unicode code for "Kui" is 594E, and the Unicode encoding for "B" is 4E59. If we receive the UTF-16 Word throttle "594E", then this is "Kui" or "B"?

The recommended method of marking byte order in the Unicode specification is the BOM. The BOM is not a BOM for "Bill of Material", but a byte order Mark. BOM is a bit of a clever idea:

In the UCS code there is a character called ZERO WIDTH No-breakspace, and its encoding is Feff. Fffe is not present in UCS, so it should not appear in the actual transmission. UCS specification recommended that we transfer the byte stream before the transmission of the character "ZERO WIDTH no-break space."

This means that if the recipient receives the Feff, the byte stream is Big-endian, and if Fffe is received, it indicates that the byte stream is Little-endian. Therefore, the character "ZERO WIDTH No-break Space" is also called the BOM.

UTF-8 does not require a BOM to indicate byte order, but you can use a BOM to indicate how the encoding is encoded. The UTF-8 code for the character "ZERO WIDTH no-breakspace" is the EF BB BF (The reader can verify this using the coding method we described earlier). So if the recipient receives a byte stream that begins with the EF BBBF, it knows that this is UTF-8 code.

Windows uses a BOM to mark the encoding of a text file.

6. Further reference materials
The main reference information in this article is "Short overview of Iso-iec 10646 and Unicode" (http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html).

I also looked for two articles that looked good, but because I got the answers to the questions I started, I didn't read them:

"Understanding Unicode A General Introduction to the Unicode Standard" (http://scripts.sil.org/cms/scripts/page.php? SITE_ID=NRSI&ITEM_ID=IWS-CHAPTER04A)
"Character Set Encoding Basics Understanding Character Set encodings and Legacy encodings" (HTTP://SCRIPTS.SIL.ORG/CMS/SC RIPTS/PAGE.PHP?SITE_ID=NRSI&ITEM_ID=IWS-CHAPTER03)
I've written UTF-8, UCS-2, GBK software packages, including versions that use Windows APIs and do not use Windows APIs. If you have time later, I will organize a delegation to my personal homepage (http://fmddlmyy.home4u.china.com).

I was thinking about all the questions before I started writing this article, I thought I would be able to write it in a few minutes. It took a long time to think about the wording and verify the details, and it was written from 1:30 to 9:00. Hope that some readers can benefit from it.

Appendix 1 The Location Code, GB2312, Inner Code and code page.
Some friends have questions about this sentence in the article:
"GB2312 of the original or location code, from the location code to the inner code, the need for high byte and low byte on the separate plus A0." ”

Let me explain in more detail:

"GB2312 of the original" refers to a national 1980 standard "The People's Republic of China National Standard Information Interchange Chinese character coded character set basic Set Gb2312-80". This standard uses two numbers to encode Chinese characters and Chinese symbols. The first number is called "District", and the second number is called "bit". So also known as location code. 1-9 is a Chinese symbol, 16-55 is a class of Chinese characters, and 56-87 is a two-level Chinese character. Now windows also has location input method, such as input 1601 to get "ah". (This location input method can automatically identify the 16-GB2312 and 10-in-system location code, which means that the input b0a1 will also get "ah". )

Inner code refers to the character encoding within the operating system. The code for the early operating system is language-dependent. Now that Windows supports Unicode within the system and then uses code pages to accommodate various languages, the concept of "inner code" is more obscure. Microsoft typically specifies the encoding of the default code page as an inner code.

There is no official definition of the word code, which is just what Microsoft is called by the company. As programmers, we just have to know what they are and there is no need to do too much research on these nouns.

The so-called code page is the character encoding for one language literal. For example, GBK's code page is Cp936,big5 's code page is cp950,gb2312 's code page is CP20936.

Windows has the concept of a default code page, that is, what encoding is used to interpret characters by default. Windows Notepad, for example, opens a text file that contains a stream of bytes: BA, BA, D7, D6. What should windows do to explain it?

Is it interpreted according to the Unicode encoding, or is it explained by GBK, or is it explained by BIG5, or according to Iso8859-1? If you press GBK to explain, you will get "Chinese characters" two words. According to other encodings, the corresponding characters may not be found, or the wrong characters may be found. The so-called "error" refers to the original text of the author does not match, then produced a garbled.

The answer is that Windows interprets the byte stream in the text file according to the current default code page. The default code page can be set through the regional Options of the Control Panel. There is an ANSI in the save for Notepad, which is actually saved according to the encoding method of the default code page.

The inner code of Windows is Unicode, which is technically capable of supporting multiple code pages at the same time. As long as the file describes what encoding you are using, and the user installs the corresponding code page, Windows will display correctly, for example, you can specify CharSet in the HTML file.

Some HTML file authors, especially English authors, believe that everyone in the world uses English and does not specify charset in the file. If he uses the 0x80-0xff between the characters, the Chinese windows in accordance with the default GBK to explain, there will be garbled. In this case, simply add the specified charset statement to the HTML file, for example:
<meta http-equiv= "Content-type" content= "text/html; Charset=iso8859-1 ">
If the original person uses the code page and iso8859-1 compatible, there will be no garbled.

Besides location code, AH's location code is 1601, written in 16 is 0x10,0x01. This conflicts with the widely used ASCII encoding of computers. In order to be compatible with 00-7f ASCII encoding, we add A0 to the high and low byte of the location code. So the code of "ah" becomes b0a1. We will add over two A0 codes also known as GB2312 encodings, although GB2312 's original text does not mention this at all.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.