Unicode and ASCII

Source: Internet
Author: User

Features of 1.ASCII

ASCII is an encoding specification used to denote English characters. Each ASCII character occupies 1 bytes, so the maximum number of characters that the ASCII encoding can represent is 255 (00H-FFH). This is not a problem for English, generally only use the first 128 (00H--7FH, the highest bit is 0). The top 1 of the other 128 characters (80H-FFH) are referred to as "extended ASCII", commonly used for storing English tabs, some phonetic characters, and other symbols.

But for more complex languages such as Chinese, 255 characters are obviously not enough. Therefore, each country has developed its own text coding specifications, which Chinese text encoding specification is called "gb2312-80", it is compatible with ASCII encoding specification, in fact, the use of extended ASCII is not really standardized this point, put a Chinese character with two extended ASCII Characters are represented to differentiate the ASCII portion of the code.
But this method has the problem, the biggest problem is the Chinese text encoding and the extended ASCII code has the overlap. Many software use the extended ASCII English tab to draw the table, such software used in the Chinese system, these tables will be mistaken as Chinese characters, garbled. In addition, because countries and regions have their own text coding rules, they conflict with each other, which brings the exchange of information between countries and regions of great trouble.

Generation of 2.UNICODE

To really solve this problem, can not start from the perspective of extended ASCII, Unicode as a new coding system came into being, it can be Chinese, French, German ... And so on all the text together consider, each text is assigned a separate encoding.

3. What is Unicode

Unicode, as well as ASCII, is a character encoding method that occupies two bytes (0000H-FFFFH) and holds 65,536 characters, which can fully accommodate the encoding of all language literals in the world. In Unicode, all characters are processed in a single character, and they all have a unique Unicode code.

4. Benefits of using Unicode

Using Unicode encoding enables your project to support multiple languages at the same time to internationalize your project. That is, in different languages of the system does not produce garbled.

UNICODE UTF8

ASCII code

Inside the computer, all the information is ultimately represented as a binary string.

Each bits has 0 and 12 states, so eight bits can combine 256 states, which is called a byte.

A byte can be used to represent a total of 256 different states, each of which corresponds to a symbol, which is 256 symbols, from 0000000 to 11111111.

In the 60 's, the United States developed a set of character encodings, which made a uniform provision for the relationship between English characters and bits.

This is known as ASCII code and has been used so far.

The ASCII code altogether specifies a 128-character encoding.

These 128 symbols take up only one byte of the back 7 bits, and the first 1 bits are uniformly defined as 0.

The ASCII code just began to be made when the length of a byte, but empty the first one, so only used 7, and a useless, then if this bit is also used, that is, 8 binary, then you can represent 256 characters, so the extension ASCII code was born, Retains the original 7-bit basis, using the most former one.

2, non-ASCII encoding

It is enough to encode 128 symbols in English, but 128 symbols are not enough to represent other languages.

As a result, some European countries decided to use the highest bits of the bytes that were idle to incorporate new symbols.

In this way, the coding system used in these European countries can represent a maximum of 256 symbols.

However, there are new problems.

Different countries have different letters, so even if they are encoded using 256 symbols, the letters are not the same.

3.Unicode

As mentioned in the previous section, there are many coding methods in the world, and the same binary numbers can be interpreted as different symbols.

It can be imagined that if there is an encoding, all the symbols in the world are included. Each symbol is given a unique encoding, then the garbled problem disappears. This is Unicode, as its name indicates, which is an encoding of all symbols.

Unicode is of course a large collection, and now the scale can accommodate the 100多万个 symbol. Each symbol is encoded differently, for example, u+0639 means that the Arabic letter ain,u+0041 represents the capital letter of the English a,u+4e25 denotes the Chinese character "strict". The specific Symbol correspondence table, may query unicode.org, or the specialized Chinese character correspondence table.

4. Problems with Unicode

Unicode is just a set of symbols that specifies only the binary code of the symbol, but does not specify how the binary code should be stored.

There are two serious problems here, and the first question is, how can you differentiate between Unicode and ASCII?

How does the computer know that three bytes represents a symbol instead of three symbols?

The second problem is that the English alphabet is only one byte to express enough, if Unicode uniform rules, each symbol with three or four bytes, then each letter must have two to three bytes is 0, which is a great waste for storage, the size of the text file will be two or three times times larger, This is unacceptable.

They result in:

1) There is a variety of Unicode storage methods, i.e. there are many different binary formats that can be used to represent Unicode.

2) Unicode cannot be promoted for a long period of time until the advent of the Internet.

5.utf-8

UTF-8 is the most widely used form of Unicode implementation on the Internet.

Other implementations include UTF-16 and UTF-32, but they are largely unused on the Internet.

Again, the relationship here is that UTF-8 is one of the ways Unicode is implemented.

One of the biggest features of UTF-8 is that it is a variable-length coding method.

It can use 1~4 bytes to represent a symbol, varying the length of a byte depending on the symbol.

The coding rules for UTF-8 are simple, with only two lines:

1) for a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol.

So for the English alphabet, the UTF-8 encoding and ASCII code are the same.

2) for n-byte notation, the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10.

The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.

The following table summarizes the encoding rules, and the letter x represents the bits that are available for encoding.

Unicode Symbol Range | UTF-8 Encoding method

(hex) | (binary)

--------------------+---------------------------------------------

0000 0000-0000 007F | 0xxxxxxx

0000 0080-0000 07FF | 110xxxxx 10xxxxxx

0000 0800-0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx

0001 0000-0010 FFFF | 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

According to the above table, it is very simple to interpret UTF-8 coding. If the first bit of a byte is 0, then the byte is a single character;

If the first digit is 1, the number of consecutive 1 is the number of bytes that the current character occupies.

6. Little Endian and Big endian

Unicode codes can be stored directly in the UCS-2 format.

Take the Chinese character "Yan" for example, the Unicode code is 4E25, need to be stored in two bytes, one byte is 4E, the other byte is 25.

Storage, 4E in front, 25 in the back, is the big endian way, 25 in front, 4E in the back, is little endian way.

Therefore, the first byte in front, is the "Big endian", the second byte in front is the "small Head Way" (Little endian).

Defined in the Unicode specification, each file is preceded by a character that represents the encoding order, which is called the "0-width non-newline space", denoted by Feff. This happens to be two bytes, and FF is 1 larger than FE.

If the first two bytes of a text file are Fe FF, it means that the file is in a large head, and if the first two bytes are FF FE, it means that the file is in a small way.

UnicodeAlthough it can hold millions of characters, it is only a huge character set, only the binary code of each symbol is specified, but there is no fine-grained storage rule, for example, when a character is stored in three bytes, it can also be understood as storing three ASCII codes, and we know ASCII The code only needs one byte, but if Unicode it is required to use three bytes per character to store it, wouldn't it be an extra two bytes of space to waste? All of these non-refined issues will result Unicode in inconsistencies.

First of all, we must be clear UTF-8(8-bit Unicode Transformation Format) in the Unified Code ( Unicode ) based on the refinement and optimization of a variable length character encoding, it is a way of implementation, Unicode in addition to UTF-8, UTF-16,UTF-32 all can be achieved Unicode , but UTF-8 relatively the most widely used.

UTF-8You can use 1 to 4 bytes to represent a character, and it has the flexibility to vary its length to store characters through its own rules Unicode .

Unicode and ASCII

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.