Introduction to character encoding: how ascii,unicode,utf-8,gb2312 and Unicode and UTF-8 translate

Source: Internet
Author: User
Tags uppercase letter

Reference:
Http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html
Http://www.cnblogs.com/mjgforever/archive/2008/02/27/1083135.html

1. ASCII code

We know that inside the computer, all the information is ultimately represented as a binary string. Each bits (bit) has 0 and 12 states, so eight bits can combine 256 states, which is called a byte. In other words, a byte can be used to represent 256 different states, each of which corresponds to a symbol, which is 256 symbols, from 0000000 to 11111111.

In the 60 's, the United States developed a set of character encodings, which made a uniform provision for the relationship between English characters and bits. This is known as ASCII code and has been used so far.

The ASCII code specifies a total of 128 characters, such as a space "space" is 32 (binary 00100000), the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed out) take up only one byte of the latter 7 bits, and the first 1-bit uniform is 0.

2, non-ASCII encoding

It is enough to encode 128 symbols in English, but 128 symbols are not enough to represent other languages. For example, in French, where there is a phonetic symbol above the letter, it cannot be represented by an ASCII code. As a result, some European countries decided to use the highest bits of the bytes that were idle to incorporate new symbols. For example, the code for E in French is 130 (binary 10000010). In this way, the coding system used in these European countries can represent a maximum of 256 symbols.

However, there are new problems. Different countries have different letters, so even if they are encoded using 256 symbols, the letters are not the same. For example, 130 is represented in the French code, but in Hebrew it represents the letter Gimel (?), and in the Russian language, another symbol is represented in the code. But anyway, in all of these encodings, 0-127 represents the same symbol, and the difference is just 128-255 of this paragraph.

As for Asian countries, the use of symbols is more, the Chinese character is about 100,000. A byte can represent only 256 symbols, which is certainly not enough, and must be expressed using multiple bytes to express a symbol. For example, the common encoding method in Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so it is theoretically possible to represent a maximum of 256x256=65536 symbols.

The issue of Chinese coding needs to be discussed in this article, which is not covered by this note. It is only pointed out that although a symbol is represented in multiple bytes, the Chinese character coding of the GB class is irrelevant to the Unicode and UTF-8.

3. Unicode

The Unicode character set (referred to as UCS), established ISO/IEC JTC1/SC2/WG2 Working Group in April 1984 to encode the languages and symbols of each country. In 1991, American multinationals established the Unicode Consortium, and in October 1991 agreed with WG2 to adopt the same coded word set. Currently Unicode is a 16-bit encoding system with the same character set content as the ISO10646 BMP (Basic multilingual Plane).

Unicode passed DIS (DRAF International Standard) in June 1992, The current version V2.0 published in 1996, the content contains 6,811 symbols, 20,902 Chinese characters, Korean pinyin 11,172, the word-writing zone 6,400, reserved 20,249, a total of 65,534.

The size of the Unicode encoding is the same. For example, an English letter "a" and a Chinese character "good", the code is occupied by the same amount of space is the same, are two bytes!

Unicode can be used to represent characters in all languages, and is a fixed-length, double-byte (also four-byte) encoding, including the English alphabet. So it can be said that it is incompatible with iso8859-1 encoding, nor is it compatible with any encoding. However, compared to the iso8859-1 encoding, the Uniocode encoding only adds a 0 byte to the front, such as the letter ' a ' is "00 61".

It is important to note that the fixed-length encoding is convenient for computer processing (note that GB2312/GBK is not a fixed-length encoding), and Unicode can be used to represent all characters, so Unicode encoding is used within many software, such as Java.

Unicode is of course a large collection, and now the scale can accommodate the 100多万个 symbol. Each symbol is encoded differently, for example, u+0639 means that the Arabic letter ain,u+0041 represents the capital letter of the English a,u+4e25 denotes the Chinese character "strict". The specific Symbol correspondence table, may query http://www.unicode.org/, or the specialized Chinese character correspondence table. Http://www.chi2ko.com/tool/CJK.htm

4. Problems with Unicode

It is important to note that Unicode is just a set of symbols, which only specifies the binary code of the symbol, but does not specify how the binary code should be stored.

For example, the Chinese character "strict" Unicode is hexadecimal number 4E25, converted to a binary number is a full 15 bits (100111000100101), that is to say, the symbol of at least 2 bytes. Representing other larger symbols, it may take 3 bytes or 4 bytes, or more.

There are two serious problems here,

    • The first question is, how can you differentiate between Unicode and ASCII? How does the computer know that three bytes represents a symbol instead of three symbols?

    • The second problem is that we already know that the English alphabet is only one byte to express enough, if Unicode uniform rules, each symbol with three or four bytes, then each letter must have two to three bytes is 0, which is a great waste for storage, the size of the text file will be two or three times times larger , it is unacceptable.

They result in: 1) There is a variety of Unicode storage methods, which means that there are many different binary formats that can be used to represent Unicode. 2) Unicode cannot be promoted for a long period of time until the advent of the Internet.

5.utf-8

The popularization of the Internet has strongly demanded the emergence of a unified coding method. UTF-8 is the most widely used form of Unicode implementation on the Internet. Other implementations include UTF-16 and UTF-32, but they are largely unused on the Internet. UTF-8 is one of the ways Unicode is implemented.

One of the biggest features of UTF-8 is that it is a variable-length coding method. It can use 1~4 bytes to represent a symbol, varying the length of a byte depending on the symbol.

The coding rules for UTF-8 are simple, with only two lines:

    1. For a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same. (The standard ASCII code is also called the base ASCII code, using a 7-bit binary number to represent all uppercase and lowercase letters, numbers 0 through 9, punctuation, and special control characters used in American English.) )

    2. For the N-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.

Unicode and UTF-8 encoding rules, the letter x represents the bits that are available for encoding.

Unicode Symbol Range | UTF-8 Encoding method
(hex) | (binary)
————————— –+ ———————————————
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-001f FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

the conversion steps for the corresponding relationship are:
1. First determine the start of the Unicode encoding
2. Make sure that UTF-8 uses a few bytes,
3. Determine the end of the Unicode encoding

Analysis

0000 0000-0000 007F (hex) | 0XXXXXXX (binary) Correspondence:

    1. First make sure that Unicode starts at 0000 0000,
    2. Unicode starts with 0, so UTF-8 starts with one byte
    3. According to the rules of UTF-8, a single-byte symbol, the first bit of a byte is set to 0, the remaining 7 bits are binary, the maximum value is 1111111, and the hexadecimal representation is: 7F.

0000 0080-0000 07FF (hex) | 110xxxxx 10xxxxxx (binary) Correspondence:
1. Unicode encoding starting from 0000 0080,
2. Since UTF-8 a byte can only be represented to 0000 007F, UTF-8 starting with 0000 0080 must use two bytes to represent
3. According to the rules of UTF-8, for N-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. So two bytes of UTF-8, must be such a 110xxxxx 10xxxxxx,xxxxx xxxxxx is where it can hold information, 11 binary maximum is, 111 1111 1111, expressed as 7FF in 16 notation. So a two-byte UTF-8 can represent a Unicode encoding 0000 0080-0000 07FF.

Taking the Chinese character "Yan" as an example, demonstrates how to implement UTF-8 coding.

Known as "Strict" Unicode is 4E25 (100111000100101), according to the table above, you can find 4E25 in the range of the third row (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes, that is, the format is " 1110xxxx 10xxxxxx 10xxxxxx ". Then, starting from the last bits of "Yan", the X in the format is filled in sequentially, and the extra bits complement 0. This gets, "strict" UTF-8 code is "11100100 10111000 10100101", converted into 16 binary is e4b8a5.

6. Conversion between Unicode and UTF-8

Using the example in the previous section, you can see that the Unicode code for "strict" is 4e25,utf-8 encoding is E4B8A5, and the two are not the same. The transitions between them can be implemented by the program.

Under the Windows platform, one of the simplest ways to convert is to use the built-in Notepad applet Notepad.exe. After opening the file, click "Save as" on the "File" menu, you will get out of a dialog box, at the bottom there is a "coded" drop-down bar.

There are four options: Ansi,unicode,unicode big endian and UTF-8.

1) ANSI is the default encoding method. For English documents is ASCII encoding, for the Simplified Chinese file is GB2312 encoding (only for the Windows Simplified Chinese version, if the traditional Chinese version will use the BIG5 code).

2) Unicode encoding refers to the UCS-2 encoding method, which is a Unicode code that is stored directly in characters with two bytes. This option uses the little endian format.

3) The Unicode big endian encoding corresponds to the previous option. In the next section I will explain the meaning of little endian and big endian.

4) UTF-8 encoding, which is the encoding method mentioned in the previous section.

After selecting the "Encoding mode", click "Save" button, the file encoding method will be converted immediately.

7. Little Endian and Big endian

As mentioned in the previous section, Unicode codes can be stored directly in the UCS-2 format. Take the Chinese character "Yan" for example, the Unicode code is 4E25, need to be stored in two bytes, one byte is 4E, the other byte is 25. Storage, 4E in front, 25 in the back, is the big endian way, 25 in front, 4E in the back, is little endian way.

The two quirky names come from the book of Gulliver's Travels by British writer Swift. In the book, the Civil War broke out in the small country, the cause of the war is people arguing, whether to eat eggs from the big Head (Big-endian) or from the head (Little-endian) knocked open. For this matter, the war broke out six times, one Emperor gave his life, and the other emperor lost his throne.

Therefore, the first byte in front, is the "Big endian", the second byte in front is the "small Head Way" (Little endian).

Then, naturally, there is a problem: How does the computer know which encoding to use for a particular file?

Defined in the Unicode specification, each file is preceded by a character that represents the encoding sequence, which is named "0-width non-newline space" (ZERO wide no-break space), denoted by Feff. This happens to be two bytes, and FF is 1 larger than FE.

If the first two bytes of a text file are Fe FF, it means that the file is in a large head, and if the first two bytes are FF FE, it means that the file is in a small way.

8. Example

Open Notepad program Notepad.exe, create a new text file, the content is a "strict" word, followed by Ansi,unicode,unicode big endian and UTF-8 encoding method to save.

Then, use the "hex feature" in the text editing software UltraEdit to see how the file is encoded internally.

1) ANSI: The encoding of the file is two bytes "D1 CF", which is the "strict" GB2312 coding, which also implies that GB2312 is stored in the big head way.

2) Unicode: Encoding is four bytes "ff fe 4E", where "FF fe" indicates a small head mode of storage, the true encoding is 4E25.

3) Unicode Big endian: The encoding is four bytes "Fe FF 4E 25", wherein "FE FF" indicates that the head is stored in the way.

4) UTF-8: The encoding is six bytes "EF BB bf E4 B8 A5", the first three bytes "EF BB bf" indicates that this is UTF-8 encoding, and after three "E4B8A5" is the specific code of "strict", its storage sequence is consistent with the encoding order.

9 GB Code 9.1

The full name is gb2312-80, "the basic set of Chinese character encoding character set for information Interchange", published in 1980, is the national standard of Chinese data processing, and the use of Simplified Chinese in the mainland and overseas (such as Singapore, etc.) is mandatory. p-windows3.2 and Apple OS are based on GB2312 as the basic Chinese character coding, Windows 95/98 GBK as the basic Chinese character coding, but compatible support GB2312.
Double-byte encoding
Range: A1a1~fefe
A1-A9: Symbol area with 682 symbols
B0-f7: Chinese character area with 6,763 Chinese characters

9.2 GB2312

GB2312 (1980) contains a total of 7,445 characters, including 6,763 Kanji and 682 other symbols. The inner code range of the Chinese character area is high byte from B0-f7, low byte from A1-fe, occupy code bit is 72*94=6768. 5 of these seats are d7fa-d7fe. GB2312-80 contains 7,545 characters and encodes one character in two bytes. The highest bit per character is 0. GB2312-80 coding abbreviation GB code.

GB2312 supports too few Chinese characters. The 1995 Chinese character extension specification GBK1.0 contains 21,886 symbols, which are divided into Chinese characters and graphic symbol areas. The Chinese character area consists of 21,003 characters.

9.3 gb12345-90

In 1990, the coding standard of traditional Chinese characters was formulated gb12345-90 the first auxiliary set of character encoding set for information interchange, which was designed to standardize the use of traditional characters in various occasions, as well as the collation of ancient books. The standard is a total of 6,866 Chinese characters (more than GB2312 more than 103 words, the other manufacturers of most of the font does not include these words), the word is more than 2,200 pure traditional.
Double-byte encoding
Range: A1a1~fefe
A1-A9: Symbol area, add vertical symbol
B0-F9: Chinese character area with 6,866 Chinese characters

9.4 GBK

The GBK code (Chinese Internal Code specification) is a new Chinese code extension national standard developed by mainland China that equates to UCS. GBK encoding can be used to represent both traditional and simplified characters, while gb2312 can only represent simplified characters, GBK is compatible with GB2312 encoding. The GBK Working Group completed the GBK specification in October 1995, the same year in December. The coding standard is compatible with GB2312, a total of 21,003 Chinese characters, 883 symbols, and provides 1894 word-of-character code, simple, traditional characters in a library. WINDOWS95/98 Simplified Chinese version of the font surface code is the use of GBK, through the GBK and UCS one by one corresponding to the code table and the underlying font contact.

English name: Chinese Internal Code specification
Chinese name: Internal Code Expansion Specification version 1.0
Dual-byte encoding, gb2312-80 extension, and gb2312-80 compatibility on code bit
Range: 8140~fefe (excluding xx7f) a total of 23,940 code bits
Contains 21,003 kanji, including all CJK Chinese characters in ISO/IEC 10646-1

Introduction to character encoding: how ascii,unicode,utf-8,gb2312 and Unicode and UTF-8 translate

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.