Character encoding Ultimate Note: ASCII, Unicode, UTF-8, UTF-16, UCS, BOM, Endian

Source: Internet
Author: User
Tags uppercase letter

Very detailed very good, turn around to study:

Reprinted from: http://www.cnblogs.com/lidabo/archive/2013/11/27/3446518.html

1, character encoding, internal code, incidentally introduced Chinese character coding

Characters must be encoded before they can be processed by the computer. The default encoding used by the computer is the internal code of the computer. Early computers used 7-bit ASCII encoding, and in order to deal with Chinese characters, programmers designed GB2312 for Simplified Chinese and big5 for traditional Chinese.

GB2312 (1980) contains a total of 7,445 characters, including 6,763 Kanji and 682 other symbols. The inner code range of the Chinese character area is high byte from B0-f7, low byte from A1-fe, occupy code bit is 72*94=6768. 5 of these seats are d7fa-d7fe.

GB2312 supports too few Chinese characters. The 1995 Chinese character extension specification GBK1.0 contains 21,886 symbols, which are divided into Chinese characters and graphic symbol areas. The Chinese character area consists of 21,003 characters. The 2000 GB18030 is the official national standard for replacing GBK1.0. The standard contains 27,484 Chinese characters, as well as Tibetan, Mongolian, Uyghur and other major minority characters. Now the PC platform must support GB18030, the embedded products are not required. So mobile phones, MP3 generally only support GB2312.

From ASCII, GB2312, GBK to GB18030, these coding methods are backwards compatible, meaning that the same character always has the same encoding in these scenarios, and the latter standard supports more characters. In these codes, English and Chinese can be handled in a unified manner. The method of distinguishing Chinese encoding is that the highest bit of high byte is not 0. According to the programmer, GB2312, GBK, and GB18030 belong to the double-byte character set (DBCS).

Some Chinese Windows default internal code or GBK, you can upgrade to GB18030 through the GB18030 upgrade package. But GB18030 relative GBK increases the character, the ordinary person is difficult to use, usually we still use the GBK to refer to the Chinese Windows inside code.

Here are some details:

GB2312 the original text or location code, from the location code to the inner code, you need to add A0 on the high and low byte respectively.

In DBCS, GB internal code storage format is always big endian, that is, high in front.

The highest bit of the two bytes of the GB2312 is 1. But the code bit that meets this condition is only 128*128=16384. So the low-byte highest bits of GBK and GB18030 are probably not 1. However, this does not affect the parsing of DBCS character streams: When reading a DBCS character stream, you can encode the next two bytes as a double byte as long as you encounter a byte with a high level of 1, without having to control what the low-byte high is.

2. Unicode, UCS, and UTF

The previously mentioned encoding methods from ASCII, GB2312, GBK to GB18030 are backwards compatible. Unicode is only compatible with ASCII (more precisely, iso-8859-1 compatible) and is incompatible with GB code. For example, the Unicode encoding of the word "Han" is 6c49, and the GB code is baba.

Unicode is also a character encoding method, but it is designed by international organizations and can accommodate all languages in the world coding scheme. The scientific name for Unicode is "Universal multiple-octet Coded Character Set", referred to as UCS. UCS can be seen as an abbreviation for "Unicode Character Set".

According to Wikipedia, there are two organizations that have tried to design Unicode independently, namely the International Organization for Standardization (ISO) and the association of a software manufacturer (unicode.org). ISO 10646 project was developed and the Unicode Association developed the Unicode Project.

Around 1991, both sides recognized that the world did not need two incompatible character sets. They then began to merge the work of the two sides and work together to create a single coding table. Starting with Unicode2.0, the Unicode project uses the same font and loadline as ISO 10646-1.

At present, two projects are still present, and the respective standards are published independently. The current version of the Unicode Association is the 2005 Unicode 4.1.0. The newest standard for ISO is 10646-3:2003.

UCS Specifies how multiple bytes are used to represent various words. How these encodings are transmitted is specified by the UTF (UCS Transformation Format) specification, and common UTF specifications include UTF-8, UTF-7, UTF-16.

The RFC2781 and RFC3629 of the IETF, with the consistent style of RFC, describe the coding methods of UTF-16 and UTF-8 in a clear, crisp, yet rigorous manner. I always remember that the IETF is the abbreviation for Internet Engineering Task force. But the RFCs that the IETF is responsible for maintaining are the basis for all the specifications on the Internet.

1. ASCII code
We know that inside the computer, all the information is ultimately represented as a binary string. Each bits (bit) has 0 and 12 states, so eight bits can combine 256 states, which is called a byte. In other words, a byte can be used to represent 256 different states, each of which corresponds to a symbol, which is 256 symbols, from 0000000 to 11111111.

In the 60 's, the United States developed a set of character encodings, which made a uniform provision for the relationship between English characters and bits. This is known as ASCII code and has been used so far.

The ASCII code specifies a total of 128 characters, such as a space "space" is 32 (binary 00100000), the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed out) take up only one byte of the latter 7 bits, and the first 1-bit uniform is 0.

2, non-ASCII encoding

It is enough to encode 128 symbols in English, but 128 symbols are not enough to represent other languages. For example, in French, where there is a phonetic symbol above the letter, it cannot be represented by an ASCII code. As a result, some European countries decided to use the highest bits of the bytes that were idle to incorporate new symbols. For example, the code for E in French is 130 (binary 10000010). In this way, the coding system used in these European countries can represent a maximum of 256 symbols.

However, there are new problems. Different countries have different letters, so even if they are encoded using 256 symbols, the letters are not the same. For example, 130 is represented in the French code, but in Hebrew it represents the letter Gimel (?), and in the Russian language, another symbol is represented in the code. But anyway, in all of these encodings, 0-127 represents the same symbol, and the difference is just 128-255 of this paragraph.

As for Asian countries, the use of symbols is more, the Chinese character is about 100,000. A byte can represent only 256 symbols, which is certainly not enough, and must be expressed using multiple bytes to express a symbol. For example, the common encoding method in Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so it is theoretically possible to represent a maximum of 256x256=65536 symbols.

The issue of Chinese coding needs to be discussed in this article, which is not covered by this note. It is only pointed out that although a symbol is represented in multiple bytes, the Chinese character coding of the GB class is irrelevant to the Unicode and UTF-8.

3.Unicode

The scientific name for Unicode is "Universal multiple-octet Coded Character Set", referred to as UCS. UCS-2 is now used, which is 2 bytes encoded, and UCS-4 is designed to prevent the future of 2 bytes from being exploited. UCS-2 is also known as the basic multilingual plane. UCS-2 conversion to UCS-4 is just simple in front plus 2 byte 0. UCS-4 is mainly used to save the auxiliary plane

As mentioned in the previous section, there are many coding methods in the world, and the same binary numbers can be interpreted as different symbols. Therefore, if you want to open a text file, you must know its encoding, or in the wrong way to interpret the code, there will be garbled. Why do e-mails often appear garbled? It is because the sender and the recipient are using different encoding methods.

It can be imagined that if there is an encoding, all the symbols in the world are included. Each symbol is given a unique encoding, then the garbled problem disappears. This is Unicode, as its name indicates, which is an encoding of all symbols.

Unicode is of course a large collection, and Unicode defaults to Little Endian mode, which now scales to accommodate 100多万个 symbols. Each symbol is encoded differently, for example, u+0639 means that the Arabic letter ain,u+0041 represents the capital letter of the English a,u+4e25 denotes the Chinese character "strict". The specific Symbol correspondence table, may query unicode.org, or the specialized Chinese character correspondence table.

4. Problems with Unicode

It is important to note that Unicode is just a set of symbols, which only specifies the binary code of the symbol, but does not specify how the binary code should be stored.

For example, the Chinese character "strict" Unicode is hexadecimal number 4E25, converted to a binary number is a full 15 bits (100111000100101), that is to say, the symbol of at least 2 bytes. Representing other larger symbols, it may take 3 bytes or 4 bytes, or more.

There are two serious problems here, and the first question is, how can you differentiate between Unicode and ASCII? How does the computer know that three bytes represents a symbol instead of three symbols? The second problem is that we already know that the English alphabet is only one byte to express enough, if Unicode uniform rules, each symbol with three or four bytes, then each letter must have two to three bytes is 0, which is a great waste for storage, the size of the text file will be two or three times times larger , it is unacceptable.

They result in: 1) There is a variety of Unicode storage methods, which means that there are many different binary formats that can be used to represent Unicode. 2) Unicode cannot be promoted for a long period of time until the advent of the Internet.

5.utf-8

The popularization of the Internet has strongly demanded the emergence of a unified coding method. UTF-8 is the most widely used form of Unicode implementation on the Internet. Other implementations include UTF-16 and UTF-32, but they are largely unused on the Internet. Again, the relationship here is that UTF-8 is one of the ways Unicode is implemented.

One of the biggest features of UTF-8 is that it is a variable-length coding method. It can use 1~4 bytes to represent a symbol, varying the length of a byte depending on the symbol.

The coding rules for UTF-8 are simple, with only two lines:

1) for a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.

2) for n-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.

The following table summarizes the encoding rules, and the letter x represents the bits that are available for encoding.

Unicode Symbol Range | UTF-8 Encoding method
(hex) | (binary)
—————— –+ ———————————————
0000 0000-0000 007F | 0xxxxxxx
0000 0080-0000 07FF | 110xxxxx 10xxxxxx
0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

Below, or take the Chinese character "Yan" as an example, demonstrates how to implement UTF-8 encoding.

Known as "Strict" Unicode is 4E25 (100111000100101), according to the table above, you can find 4E25 in the range of the third row (0000 0800-0000 FFFF), so "strict" UTF-8 encoding requires three bytes, that is, the format is " 1110xxxx 10xxxxxx 10xxxxxx ". Then, starting from the last bits of "Yan", the X in the format is filled in sequentially, and the extra bits complement 0. This gets, "strict" UTF-8 code is "11100100 10111000 10100101", converted into 16 binary is e4b8a5.

6. Conversion between Unicode and UTF-8

Using the example in the previous section, you can see that the Unicode code for "strict" is 4e25,utf-8 encoding is E4B8A5, and the two are not the same. The transitions between them can be implemented by the program.

Under the Windows platform, one of the simplest ways to convert is to use the built-in Notepad applet Notepad.exe. After opening the file, click "Save as" on the "File" menu, you will get out of a dialog box, at the bottom there is a "coded" drop-down bar.

There are four options: Ansi,unicode,unicode big endian and UTF-8.

1) ANSI is the default encoding method. For English documents is ASCII encoding, for the Simplified Chinese file is GB2312 encoding (only for the Windows Simplified Chinese version, if the traditional Chinese version will use the BIG5 code).

2) Unicode encoding refers to the UCS-2 encoding method, which is a Unicode code that is stored directly in characters with two bytes. This option uses the little endian format.

3) The Unicode big endian encoding corresponds to the previous option. In the next section I will explain the meaning of little endian and big endian.

4) UTF-8 encoding, which is the encoding method mentioned in the previous section.

After selecting the "Encoding mode", click "Save" button, the file encoding method will be converted immediately.

7. Little Endian and Big endian

As mentioned in the previous section, Unicode codes can be stored directly in the UCS-2 format. Take the Chinese character "Yan" for example, the Unicode code is 4E25, need to be stored in two bytes, one byte is 4E, the other byte is 25. Storage, 4E in front, 25 in the back, is the big endian way, 25 in front, 4E in the back, is little endian way.

The two quirky names come from the book of Gulliver's Travels by British writer Swift. In the book, the Civil War broke out in the small country, the cause of the war is people arguing, whether to eat eggs from the big Head (Big-endian) or from the head (Little-endian) knocked open. For this matter, the war broke out six times, one Emperor gave his life, and the other emperor lost his throne.

Therefore, the first byte in front, is the "Big endian", the second byte in front is the "small Head Way" (Little endian).

Then, naturally, there is a problem: How does the computer know which encoding to use for a particular file?

Defined in the Unicode specification, each file is preceded by a character that represents the encoding sequence, which is named "0-width non-newline space" (ZERO wide no-break space), denoted by Feff. This happens to be two bytes, and FF is 1 larger than FE.

If the first two bytes of a text file are Fe FF, it means that the file is in a large head, and if the first two bytes are FF FE, it means that the file is in a small way.

8, UCS-2, UCS-4, BMP

UCS has two forms: UCS-2 and UCS-4. As the name implies, UCS-2 is encoded with two bytes, and UCS-4 is encoded with 4 bytes (actually 31 bits, the highest bit must be 0). Let's do some simple math games:

UCS-2 has 2^16=65536 code bit, UCS-4 has 2^31=2147483648 yards.

The UCS-4 is divided into 2^7=128 groups according to the highest byte maximum of 0 bytes. Each group is then divided into 256 plane according to the sub-high byte. Each plane is divided into 256 rows according to the 3rd byte (rows), and each row consists of 256 cells. Of course the cells in the same row are only the last byte and the rest are the same.

The Plane 0 of group 0 is known as the basic multilingual Plane, or BMP. Or UCS-4, a code bit with a height of two bytes of 0 is called a BMP.

The UCS-4 bmp is removed from the previous two 0 bytes to get the UCS-2. The BMP of UCS-4 is obtained by adding two 0 bytes before the two bytes of the UCS-2. No characters in the current UCS-4 specification are allocated outside of the BMP.

9. UTF byte order and BOM

UTF-8 is a byte-coded unit with no byte order problem. UTF-16 takes two bytes as the encoding unit, before interpreting a UTF-16 text, it is first to clarify the byte order of each coding unit. For example, the Unicode encoding for receiving a "Kui" is 594E, and the Unicode encoding for "B" is 4E59. If we receive the UTF-16 byte stream "594E", then is this "Kui" or "B"?

The recommended method for labeling byte order in the Unicode specification is the BOM. The BOM is not a BOM for "Bill of Material", but a byte Order Mark. The BOM is a bit of a smart idea:

There is a character called "ZERO WIDTH no-break SPACE" in the UCS encoding, and its encoding is Feff. Fffe is not a character in UCS, so it should not appear in the actual transmission. The UCS specification recommends that the character "ZERO WIDTH no-break SPACE" be transmitted before the byte stream is transmitted.

This means that if the recipient receives Feff, the byte stream is Big-endian, and if Fffe is received, it indicates that the byte stream is Little-endian. So the character "ZERO WIDTH no-break SPACE" is also called a BOM.

The UTF-8 does not require a BOM to indicate byte order, but it can be used to indicate the encoding using a BOM. The UTF-8 code for the character "ZERO WIDTH no-break SPACE" is the EF BB BF (readers can verify it with the coding method we described earlier). So if the receiver receives a byte stream beginning with the EF BB BF, it knows that this is UTF-8 encoded.

Windows uses a BOM to mark the way a text file is encoded.


10. Example

Below, give an example.

Open Notepad program Notepad.exe, create a new text file, the content is a "strict" word, followed by Ansi,unicode,unicode big endian and UTF-8 encoding method to save.

Then, use the "hex feature" in the text editing software UltraEdit to see how the file is encoded internally.

1) ANSI: The encoding of the file is two bytes "D1 CF", which is the "strict" GB2312 coding, which also implies that GB2312 is stored in the big head way.

2) Unicode: Encoding is four bytes "ff fe 4E", where "FF fe" indicates a small head mode of storage, the true encoding is 4E25.

3) Unicode Big endian: The encoding is four bytes "Fe FF 4E 25", wherein "FE FF" indicates that the head is stored in the way.

4) UTF-8: The encoding is six bytes "EF BB bf E4 B8 A5", the first three bytes "EF BB bf" indicates that this is UTF-8 encoding, and after three "E4B8A5" is the specific code of "strict", its storage sequence is consistent with the encoding order.

9. Extended Reading

* The Absolute Minimum every software Developer absolutely, positively must Know about Unicode and Character sets (on the most basic of the character set This knowledge)

* Talk about Unicode encoding

* Rfc3629:utf-8, a transformation format of ISO 10646 (if UTF-8 is implemented)

Source: http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

The main reference information in this paper is "Short Overview of Iso-iec 10646 and Unicode" (http://www.nada.kth.se/i18n/ucs/unicode-iso10646-oview.html).

I've also looked for two good-looking materials, but because I've got the answers to the questions I started, I didn't see them:

"Understanding Unicode A General Introduction to the Unicode Standard" (http://scripts.sil.org/cms/scripts/page.php?) SITE_ID=NRSI&ITEM_ID=IWS-CHAPTER04A)

"Character Set Encoding Basics Understanding Character Set encodings and Legacy encodings" (HTTP://SCRIPTS.SIL.ORG/CMS/SC RIPTS/PAGE.PHP?SITE_ID=NRSI&ITEM_ID=IWS-CHAPTER03)

Character encoding: Unicode/utf-8/utf-16/ucs/endian/bmp/bom
Unicode (Universal multiple-octet Coded Character Set): Currently the most popular and most promising character encoding specification, because it resolves conflicts in different language encodings.

UICode Origin:

The initial character encoding ASCII (8bit, the highest bit 0) can only represent 128 characters, indicating that English, numerals, and some symbols are no problem. But the world has more than one language, that is, the use of up to 1 extended ASCII code, but also only 256 characters.

For Chinese and Japanese, Korean, Arabic and other complex words, can not be used.

Therefore, all countries have developed their own compatible ASCII coding specifications, is a variety of ANSI code, such as our country's gb2312, with two extended ASCII characters to represent a Chinese. However, these ANSI codes cannot exist at the same time, because their definitions overlap, and to be free to use different languages there must be a new encoding that distributes the code uniformly for all kinds of text.

The ISO (International Organization for Standardization) and the UICode Association (an Association of software manufacturers) began the work separately. ISO 10646 project and Unicode Association's Unicode project. They then began to merge the results of their work, using the same font and loadline. But currently, two projects exist and publish their own standards independently.

UCS (Unicode Character Set):

This is the name of UICode in ISO, with two sets of encoding methods, UCS-2 (Unicode) representing one character in 2 bytes, and UCS-4 (Unicode-32) representing one character in 4 bytes. The UCS-4 is extended by the USC-2, adding a 2-byte high. Even if the old UCS-2, it can also represent 2^16=65535 characters, basically can accommodate all the commonly used national characters, so at present basically use UCS-2.

UTF (UCS transformation Format):

Unicode uses 2 bytes to represent one character, ASCII uses 1 bytes, so conflicts arise in many ways, and methods that previously process ASCII must be rewritten. and c is used as a string end flag, but a string function with many characters in Unicode that contains \0,c language does not work correctly with Unicode. To put the Unicode into practical use, there was UTF, the most common being UTF-8 and UTF-16.

The encoding of UTF-16 and Unicode itself is consistent, and UTF-32 and UCS-4 are the same. The most important is the UTF-8, which is fully compatible with ASCII encoding. UTF is a variable-length encoding whose number of bytes is not fixed and the number of bytes is determined using the first byte. The first byte begins with 0, which is one byte, 110 is 2 bytes, 1110 is 3 bytes, and the characters follow Byte starts with 10, so it is not confusing and single-byte English characters can still be ASCII encoded. Theoretically UTF-8 can represent a character in the maximum of 6 bytes, but Unicode does not currently use characters greater than 0xffff, and the actual UTF-8 uses up to 3 bytes.

Methods of converting Unicode to UTF-8

Unicode code range UTF-8 encoding (convert Unicode code to binary fill x)
0000-007f 0xxxxxxx
0080-07FF 110xxxxx 10xxxxxx
0800-FFFF 1110xxxx 10xxxxxx 10xxxxxx

The Unicode encoding range for Chinese characters is 0080-07ff, so it is 2 bytes encoded.

Big Endian (large endian) and little Endian (small endian):

Unicode storage has a byte-order problem, that is, a multibyte number, from large to small or vice versa. This is related to CPU processing, the general x86 processing is inverted, that is, the large number in front. Just like the "Mo" character Unicode code 0X83AB, press big endian to become 0xab83.

BOM (Byte Order Mark):

Because of the problem of byte order in Unicode storage, inserting a non-existent character (ZERO WIDTH no-break SPACE) in front of the Unicode text is used as a flag to distinguish between the two kinds of byte-order. The sign 0xFEFF description presses the big Endian byte order, while the 0xFFFE description Little-endian.

The UTF-8 does not require a BOM to describe the byte order, but it can be encoded with a BOM mark. When you encounter text with the beginning of 0XEFBBBF, the computer can be processed directly by UTF-8 encoding without the need for resolution.

BMP (Basic multilingual Plane):

This is the concept of the Unicode actual and character-corresponding partitioning method.

Press UCS-4 as an example

The first byte of the first constant is 0, the remaining 7 bits can be divided into 2^7=128 group.

The second byte, under each group, can have 2^8=256 plane (planar).

The third byte, which can bring 256 rows to each palne.

The fourth byte, where 8 bits can be divided into 256 cells per row (grid).

The plane 0 in group 0 is the BMP, which is the UCS-4 code of the first two bytes of 0x0000. The UCS-4 on the BMP that was removed from the 0x0000 became UCS-2 encoded. Or UCS-2 is a subset of USC-4, BMP is the position of UCS-2 in USC-4. From here we can also get the method of USC-2 to UCS-4, and then insert 2 bytes 0x0000 before UCS-2.

Source: http://blog.csdn.net/zzcv_/archive/2007/06/03/1636085.aspx

Character encoding Ultimate Note: ASCII, Unicode, UTF-8, UTF-16, UCS, BOM, Endian

Related Article

E-Commerce Solutions

Leverage the same tools powering the Alibaba Ecosystem

Learn more >

Apsara Conference 2019

The Rise of Data Intelligence, September 25th - 27th, Hangzhou, China

Learn more >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.