(20161010) on the page garbled and character encoding method

Source: Internet
Author: User
Tags control characters

The reason why the webpage is garbled is generally because the characters are encoded differently.

Character encoding is the basis of computer technology, for computers, all information is 0 or 1 of the binary sequence, the computer can not directly identify and store characters, so, the characters must be encoded to be processed by the computer.

One or two concepts: Character set and character encoding

Character set: Intuitively, it is the mapping between a series of characters and a binary sequence (number) that is pre-defined.

The more common character sets are ASCII, GBK, Unicode, and so on.

But when we set the relationship between the character and the number, but this does not mean that the computer must be the number of characters corresponding to the number itself directly stored, so, we need to set a rule, the code element of these characters are processed again, so that it is more suitable for computer storage, network transmission needs.

The character encoding is how to encode and store the binary sequence corresponding to these characters.

Therefore, the character set is a protocol, and character encoding is a better implementation of the character set of a rule, so a character set has a different encoding method.

Second, the common encoding method:

Unicode, ASCII, GBK, GB2312, UTF-8

Third, about ASCII code

This is the code specification that the United States developed in the 1860s to establish a relationship between English characters and binary, which can represent 128 characters, including English characters, Arabic numerals, western characters, and 32 control characters. It uses a byte to represent a specific character, but it uses only the last 7 bits to represent the character (2^7=128), and the first one is a uniform rule of 0.

Four, extended ASCII code

The original ASCII code for the English language of the country is sufficient, but some languages of European countries will have pinyin, then 7 bytes is not enough. As a result, some European countries decided to use the highest bits of the bytes that were idle to incorporate new symbols. For example, the code for E in French is 130 (binary 10000010). In this way, the coding system used in these European countries can represent a maximum of 256 symbols. But then there's the problem: different countries have different letters, so even if they all use 256 symbols, the letters are not the same. For example, 130 is represented in the French code, but in Hebrew it represents the letter Gimel (?), and in the Russian language, another symbol is represented in the code. But anyway, in all of these encodings, 0-127 represents the same symbol, and the difference is just 128-255 of this paragraph. This problem directly prompted the generation of Unicode encoding.

V. Set of Unicode symbols

As mentioned in the previous section, there are many coding methods in the world, and the same binary numbers can be interpreted as different symbols. Therefore, if you want to open a text file, you must know its encoding, or in the wrong way to interpret the code, there will be garbled. Why do e-mails often appear garbled? It is because the sender and the recipient are using different encoding methods. And Unicode is such a code: it contains all the symbols of the world, and each symbol is unique. For example, u+0639 that the Arabic letter ain,u+0041 means that the capital letter of English A,u+4e25 denotes the Chinese character "strict". The specific Symbol correspondence table, may query unicode.org, or the specialized Chinese character correspondence table. Many people say Unicode encoding, but in fact Unicode is a set of symbols (the symbol set of all symbols in the world), rather than a new encoding method.

But just because Unicode contains all the characters, and some countries have characters that can be represented by a single byte, some countries ' characters have to be expressed in more than one byte. That produces two questions: first, if there are two bytes of data, how does the computer know that the two bytes represent a Chinese character? Or does it mean two letters of English? Second, because the different characters need the same storage length, if the Unicode rule uses 2 bytes to store the characters, then the English characters stored with 1 bytes are 0, which greatly wasted storage space.

The result of the above two problems is: 1) there is a variety of Unicode storage methods, that is, there are many different binary formats, can be used to represent Unicode. 2) Unicode cannot be promoted for a long period of time until the advent of the Internet.

Liu, UTF-8

The popularization of the Internet has strongly demanded the emergence of a unified coding method. UTF-8 is the most widely used form of Unicode implementation on the Internet. Other implementations include UTF-16 and UTF-32, but they are largely unused on the Internet. Again, the relationship here is that UTF-8 is one of the ways Unicode is implemented.

One of the biggest features of UTF-8 is that it is a variable-length coding method. It can use 1~4 bytes to represent a symbol, varying the length of a byte depending on the symbol.

The coding rules for UTF-8 are simple, with only two lines:

1) for a single-byte symbol, the first bit of the byte is set to 0, and the next 7 bits are the Unicode code for the symbol. So for the English alphabet, the UTF-8 encoding and ASCII code are the same.

2) for n-byte notation (n>1), the first n bits are set to 1, the n+1 bit is set to 0, and the first two bits of the subsequent bytes are set to 10. The rest of the bits are not mentioned, all of which are Unicode codes for this symbol.

Seven, gbk/gb2312/gb18030

GBK and GB2312 are both coded for simplified characters, but GB2312 only supports more than 6,000 Chinese characters, while GBK supports more than 10,000 Chinese character coding. And GB18030 is used for the encoding of traditional characters. Chinese characters are stored using two bytes of storage.

Overall:

ASCII encoding: Used to denote English, it is represented by 1 bytes, where the first digit is 0 and the other 7 bits store data, which can represent 128 characters altogether.

Extended ASCII encoding: used to represent more European text, with 8 bits to store data, a total of 256 characters

gbk/gb2312/gb18030: denotes Chinese characters. gbk/gb2312 is Simplified Chinese, GB18030 means traditional Chinese.

Unicode encoding: Contains all the characters in the world and is a character set.

UTF-8: is one of the ways Unicode characters are implemented, which uses 1-4 characters to represent a symbol, varying the length of a byte depending on the symbol.

(20161010) on the page garbled and character encoding method

Related Article

E-Commerce Solutions

Leverage the same tools powering the Alibaba Ecosystem

Learn more >

Apsara Conference 2019

The Rise of Data Intelligence, September 25th - 27th, Hangzhou, China

Learn more >

Alibaba Cloud Free Trial

Learn and experience the power of Alibaba Cloud with a free trial worth $300-1200 USD

Learn more >

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.