Character set encoding Ascii,unicode and UTF-8 how much do you know? Character Set encoding Rollup (favorites)

Source: Internet
Author: User
Tags control characters uppercase letter
How much do you know about character set encoding Ascii,unicode and UTF-8? This article will give you a thorough understanding of character set encoding. This article describes the problems and transformations of Ascii,unicode and UTF-8 coding, as well as example analysis. Start reading the article.

One, ASCII code

We know that inside the computer, all information is ultimately a binary value. Each bits (bit) has 0 and 12 states, so eight bits can combine 256 states, which is called a byte. In other words, a byte can be used to represent 256 different states, each of which corresponds to a symbol, which is 256 symbols, from 00000000 to 11111111.

In the 60 's, the United States developed a set of character encodings, which made a uniform provision for the relationship between English characters and bits. This is known as ASCII code and has been used so far.

The ASCII code specifies a total of 128 characters, such as the space is 32 (binary 00100000), the uppercase letter A is 65 (binary 01000001). These 128 symbols (including 32 control symbols that cannot be printed out) occupy only one byte of the latter 7 bits, and the first one is 0.

ASCII control characters

ASCII to display characters

Second, non-ASCII encoding

It is enough to encode 128 symbols in English, but 128 symbols are not enough to represent other languages. For example, in French, where there is a phonetic symbol above the letter, it cannot be represented by an ASCII code. As a result, some European countries decided to use the highest bits of the bytes that were idle to incorporate new symbols. For example, the code for E in French is 130 (binary 10000010). In this way, the coding system used in these European countries can represent a maximum of 256 symbols.

However, there are new problems. Different countries have different letters, so even if they are encoded using 256 symbols, the letters are not the same. For example, 130 represents E in French encoding, but in Hebrew it represents the letter Gimel (ג), which in the Russian language also represents another symbol. But anyway, in all of these encodings, 0--127 represents the same symbol, not just the 128--255 section.

As for Asian countries, the use of symbols is more, the Chinese character is about 100,000. A byte can represent only 256 symbols, which is certainly not enough, and must be expressed using multiple bytes to express a symbol. For example, the common encoding method in Simplified Chinese is GB2312, which uses two bytes to represent a Chinese character, so it is theoretically possible to represent up to a maximum of 256 x 65,536 symbols.

The issue of Chinese coding needs to be discussed in this article, which is not covered by this note. It is only pointed out that although a symbol is represented in multiple bytes, the Chinese character coding of the GB class is irrelevant to the Unicode and UTF-8.

Three. Unicode

As mentioned in the previous section, there are many coding methods in the world, and the same binary numbers can be interpreted as different symbols. Therefore, if you want to open a text file, you must know its encoding, or in the wrong way to interpret the code, there will be garbled. Why do e-mails often appear garbled? It is because the sender and the recipient are using different encoding methods.

It can be imagined that if there is an encoding, all the symbols in the world are included. Each symbol is given a unique encoding, then the garbled problem disappears. This is Unicode, as its name indicates, which is an encoding of all symbols.

Unicode is of course a large collection, and now the scale can accommodate the 100多万个 symbol. Each symbol is encoded differently, for example, u+0639 means that the Arabic letter ain,u+0041 represents the English uppercase letter A,u+4e25 The Chinese character is strict. The specific Symbol correspondence table, may query unicode.org, or the specialized Chinese character correspondence table.

Iv. Issues with Unicode

It is important to note that Unicode is just a set of symbols, which only specifies the binary code of the symbol, but does not specify how the binary code should be stored.

For example, the strict Unicode character is hexadecimal number 4E25, the conversion to a binary number is a full 15 bits (100111000100101), that is to say, the symbol of at least 2 bytes. Representing other larger symbols, it may take 3 bytes or 4 bytes, or more.

There are two serious problems here , and the first question is, how can you differentiate between Unicode and ASCII? How does the computer know that three bytes represents a symbol instead of three symbols? The second problem is that we already know that the English alphabet is only one byte to express enough, if Unicode uniform rules, each symbol with three or four bytes, then each letter must have two to three bytes is 0, which is a great waste for storage, the size of the text file will be two or three times times larger , it is unacceptable.

they result in : 1) There is a variety of Unicode storage methods, which means that there are many different binary formats that can be used to represent Unicode. 2) Unicode cannot be promoted for a long period of time until the advent of the Internet.

Wu, UTF-8

The popularization of the Internet has strongly demanded the emergence of a unified coding method. UTF-8 is the most widely used form of Unicode implementation on the Internet. Other implementations include UTF-16 (characters in two-byte or four-byte notation) and UTF-32 (characters in four-byte notation), but not on the Internet. Again, the relationship here is that UTF-8 is one of the ways Unicode is implemented.

one of the biggest features of UTF-8 is that it is a variable-length coding method. It can use 1~4 bytes to represent a symbol, varying the length of a byte depending on the symbol.

the coding rules for UTF-8 are simple, with only two lines:

1. For a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.

2. For n-byte symbols (n > 1), the first n bits are set to 1, the Nth + 1 bits are set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.

The following table summarizes the encoding rules, and the letter x represents the bits that are available for encoding.

According to the above table, it is very simple to interpret UTF-8 coding. If the first bit of a byte is 0, then the byte is a single character, and if the first bit is 1, how many consecutive 1 is the number of bytes that the current character occupies.

Below, or take the Chinese character strictly as an example, demonstrates how to implement UTF-8 encoding.

Strict Unicode is 4E25 (100111000100101), according to the table above, you can find 4E25 in the range of the third row (0000 0800-0000 FFFF), so strict UTF-8 encoding requires three bytes, that is, the format is 1110xxxx 10xxxxxx 10xxxxxx. Then, the strict last bits begins, and sequentially fills in the format of the x, the extra bits complement 0. This gets, strict UTF-8 code is 11100100 10111000 10100101, converted into 16 binary is e4b8a5.

Vi. conversion between Unicode and UTF-8

Using the example in the previous section, you can see that the strict Unicode code is 4E25,UTF-8 encoding is E4B8A5, and the two are not the same. The transitions between them can be implemented by the program.

Windows platform, there is one of the simplest conversion methods, is to use the built-in Notepad applet notepad.exe. After opening the file, click the Save As command in the File menu to jump out of a dialog box with an encoded drop-down bar at the bottom.

There are four options: Ansi,unicode,unicode big endian and UTF-8.

    • ANSI is the default encoding method. For English documents is ASCII encoding, for the Simplified Chinese file is GB2312 encoding (only for the Windows Simplified Chinese version, if the traditional Chinese version will use the BIG5 code).

    • The Unicode encoding refers to the UCS-2 encoding used by Notepad.exe, which is a Unicode code that is stored directly in characters with two bytes, and this option is used in the little endian format.

    • The Unicode big endian encoding corresponds to the previous option. In the next section I will explain the meaning of little endian and big endian.

    • UTF-8 encoding, which is the encoding method mentioned in the previous section.

After selecting the "Encoding mode", click "Save" button, the file encoding method will be converted immediately.

Seven, Little endian and Big endian

As mentioned in the previous section, the UCS-2 format can store Unicode codes (code points not exceeding 0xFFFF). Take the Chinese character strict as an example, the Unicode code is 4E25, it needs two bytes to store, one byte is 4E and the other byte is 25. Storage, 4E in front, 25 in the back, this is the Big endian way, 25 in front, 4E in the rear, this is Little endian way.

The two quirky names come from the book of Gulliver's Travels by British writer Swift. In the book, the Civil War broke out in the small country, the cause of the war is people arguing, whether to eat eggs from the big Head (Big-endian) or from the head (Little-endian) knocked open. For this matter, the war broke out six times, one Emperor gave his life, and the other emperor lost his throne.

The first byte is in front of the "Big endian", the second byte is the "Small Head Way" (Little endian).

Then, naturally, there is a problem: How does the computer know which encoding to use for a particular file?

The Unicode specification defines that each file is preceded by a character that represents the encoding sequence, which is named "0-width non-newline space" (zero width no-break space), denoted by Feff. This happens to be two bytes, and FF is 1 larger than FE.

If the first two bytes of a text file are Fe FF, it means that the file is in a large head, and if the first two bytes are FF FE, it means that the file is in a small way.

Viii. examples

Below, give an example.

Open Notepad program notepad.exe, create a new text file, the content is a strict word, followed by Ansi,unicode,unicode big endian and UTF-8 encoding method to save.

Then, use the "hex feature" in the text editing software UltraEdit to see how the file is encoded internally.

    • ANSI: The encoding of the file is two bytes D1 CF, which is the strict GB2312 encoding, which also implies that GB2312 is stored in the big head way.

    • Unicode: Encoding is four bytes ff fe 4E, where FF FE indicates that it is a small head way to store, and the true encoding is 4E25.

    • Unicode Big endian: Encoding is four bytes Fe ff 4E 25, where FE FF indicates that it is large head mode storage.

    • UTF-8: Encoding is six bytes EF BB bf E4 B8 A5, first three bytes EF BB bf says this is UTF-8 code, the last three e4b8a5 is strict specific code, its storage order is consistent with the encoding order.

Ix. extension of reading (extracurricular knowledge)

The Absolute Minimum every software Developer absolutely, positively must Know about Unicode and Character sets (the most basic knowledge about character sets Knowledge

Talk about Unicode encoding: rfc3629:utf-8, a transformation format of ISO 10646 (if UTF-8 is implemented)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.