Encoding problems frequently encountered in HTML and javascript

Encoding problems frequently encountered in HTML and javascript _ javascript skills

Last Update:2017-05-11 Source: Internet

Author: User

Tags bit set

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In our daily front-end development work, we often deal with HTML, javascript, css and other languages, just like a real language, computer languages also have their alphabet, syntax, lexical, and encoding methods. In daily front-end development, we often deal with HTML, javascript, css, and other languages, like a real language, a computer language also has its alphabet, syntax, lexical, and encoding methods.

Here I will briefly discuss the coding problems that often occur in the daily work of front-end HTML and javascript.
In the computer, the information we store is represented by binary code. What we know, the English and Chinese characters displayed on the screen, and the conversion of the stored binary code, that is, encoding.

There are two basic concepts to describe: charset and character encoding:

Charset, character set, is a table of the ing between a symbol and a number, that is, it determines that 107 is the 'a' of koubei, and 21475 is the "Port" of reputation ", different tables have different ing relationships, such as ascii, gb2312, and Unicode. through this ing table of numbers and characters, we can convert a binary number into a certain character.
Chracter encoding: encoding method. For example, do we use \ u5k3e3 or % E5 % 8F % A3 to represent the 21475 value of the Response port? This is determined by character encoding.

For 'kubei. com 'is a common character set of Americans. They developed an ASCII character set, which is short for american standard code of information interchange, use the 128 numbers 0-128 (the power of 2, 0 × 00-0 × 7f) to represent the commonly used characters such as 123abc. A total of 7 bits, plus the first is the symbol bit, to use the complement code to represent a negative number or something, a total of 8 bits constitute a byte. The Americans were just a little stingy. If we designed a byte as 16 bits and 32 bits from the very beginning, there would be fewer problems in the world, but at that time, they thought 8 bits would be enough, it can represent 128 different characters!

This computer is made up by Americans, so they save time and code all their household symbols. But when the computer started internationalization, the problem came out. Let's take China as an example. There are tens of thousands of Chinese characters. What should I do?

The existing 8 bits one-byte system is the foundation and cannot be damaged. It cannot be changed to 16 bits or the like. Otherwise, the change is too large and you can only take another path: multiple ascii characters are used to represent one other Character, that is, MBCS (Multi-Byte Character System, Multi-Byte Character System ).
With the concept of MBCS, we can represent more characters. For example, if we use 2 ascii characters, we can use 16 bits. Theoretically, we can use 2 to the power of 16 to 65536 characters. But how can these encodings be assigned to characters? For example, the Unicode code of koubei's "Port" is 21475. who decides? Character Set, that is, the charset just introduced. Ascii is the most basic character set, on which we have MBCS character sets similar to gb2312 and big5 for simplified Chinese and Traditional Chinese. Finally, a mechanism called Unicode Consortium decided to create a Character Set including all characters (UCS, Universal Character Set) and the corresponding encoding standard, that is, Unicode. Since 1991, it has released the first International Unicode standard, ISBN 0-321-18578-1, and ISO has also participated in this customization. ISO/IEC 10646: the Universal Character Set. In short, Unicode is a character standard that basically covers all existing symbols on the earth, and is now widely used. The ECMA standard also stipulates that, the internal characters in the javascript language use the Unicode Standard (this means that the javascript variable name and function name are allowed in Chinese !).

Developers in China may encounter many problems, such as conversion between gbk, gb2312, and UTF-8. Strictly speaking, this statement is not very accurate. gbk and gb2312 are character sets, while UTF-8 is a encoding method (character encoding ), is a Unicode Standard in the UCS character set encoding method, because the use of Unicode character sets of web pages mainly using UTF-8 encoding, so they are often tied, in fact, is not accurate.

With Unicode, at least before human civilization encounters aliens, This Is A omnipotent key. Use it. And now the most widely used Unicode encoding method is the UTF-8 (8-bit UCS/Unicode Transformation Format), it has a few particularly good places:

It is a world-class, world-class
Is a variable-length encoding method (variable-length character encoding), compatible with ascii
The second advantage is that it makes the previous pure ascii encoding system compatible and does not increase the additional storage (false setting of long encoding mode, it is specified that each character is composed of two bytes, so the storage space occupied by ascii characters will increase by a factor ).

To make the UTF-8 clear, it is more convenient to introduce a table:

U-00000000-U-0000007F: 0 xxxxxxx
U-00000080-U-000007FF: 110 xxxxx 10 xxxxxx
U-00000800-U-0000FFFF: 1110 xxxx 10 xxxxxx 10 xxxxxx
U-00010000-U-001FFFFF: 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
U-00200000-U-03FFFFFF: 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx
U-04000000-U-7FFFFFFF: 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

To understand this table, we can see the first two rows.

U-00000000-U-0000007F:
The first line of 0 xxxxxxx is like this, which means that if you find that the binary code of a UTF-8 encoded byte is 0 xxxxxxx, it starts with 0, that is, between 0 and in decimal format, this byte represents a single character and has the same meaning as the ascii code. All other UTF-8 encoded binary values are 1 xxxxxxx starting with 1, greater than 127, and must be at least 2 bytes to represent a symbol. Therefore, the first character of a byte is a switch, which indicates whether the character is an ascii code. This is the compatibility mentioned just now. In terms of English definition, it is two attributes of utf8 encoding:

Uccharacters U + 0000 to U + 007F (ASCII) are encoded simply as bytes 0 × 00 to 0 × 7F (ASCII compatibility ). this means that files and strings which contain only 7-bit ASCII characters have the same encoding under both ASCII and UTF-8.
All UCS characters> U + 007F are encoded as a sequence of several bytes, each of which has the most significant bit set. therefore, no ASCII byte (0 × 00-0 × 7F) can appear as part of any other character.

Then let's look at the second line:

U-00000080-U-000007FF: 110 xxxxx 10 xxxxxx
First look at the first byte: 110 xxxxx, which means that I am not an ascii code (because the first byte is not 0 ), I am the first byte of a Multi-bytes character (the second digit is 1). the character I participate in is composed of two bytes (the third digit is 0 ), starting from the fourth digit, It is the location where character information is stored.
Let's look at the second byte: 10 xxxxxx, which means that I am not an ascii code (because the first byte is not 0 ), I am not the first byte of a Multi-bytes character (the second is 0), and the third is the position where the character information is stored.

In this example, we can conclude that in UTF-8 encoding, two to six bytes may represent a symbol in a long string of consecutive binary bytes, in comparison to an ascii code that uses a byte symbol, we need space to store two additional information: 1. the starting position of the symbol and the position of a starter, in biology, it is the starting position of the AUG during protein translation; 2. the number of bytes used for this symbol (if each symbol has starter, this length can be left blank, but providing length information increases the fault tolerance capability when some bytes are lost ). Solution: Use the second digit of a byte to indicate whether the byte is the starting byte of a character (because the first digit in a byte has just been used, 0 indicates the ascii code, 1 indicates non-ascii), that is, the first bytes of a Multi-byte symbol must be 11 xxxxxx, a binary number between 192 and 255. Next, the length information is provided starting from the third digit. The third digit is 0, which indicates that the character is 2 bytes. The third digit starts with one more bytes and the number of bytes occupied by the character increases by one. UTF-8 can be defined as a maximum of 6 bytes. It requires four more 1 characters than the starter that represents 2 bytes, such as 110 xxxxx. Therefore, this starter is 1111110x, as shown in the table above.
Let's look at the standard defined in English:

The first byte of a multibyte sequence that represents a non-ASCII character is always in the range 0xC0 to 0xFD and it indicates how many bytes follow for this character. all further bytes in a multibyte sequence are in the range 0 × 80 to 0xBF. this allows easy resynchronization and makes the encoding stateless and robust against missing bytes.

The real information bit (that is, the number information in the real charset Character Set) is directly placed on the 'X' of the preceding table in binary order. Let's use Chinese programmers who have the most contact with Chinese characters, their encoding range is between U-00000800-U-0000FFFF, from the above table can be found, the UTF-8 encoding in this range is represented in three bytes (that is, UTF-8 encoded Chinese characters use more storage space than the characters in the gb2312 character set that each character occupies 2 bytes of EUC-CN Encoding ), let's take the word "Mouth" as an example. The number of the mouth word in Unicode is as follows:
Port: 21475 = 0x53e3 = binary 101001111100011

In javascript, run the code (use the firebug console, or edit an HTML file to insert the following code into a script tag ):

Alert ('\ u53e3'); // get 'Port'
Alert (escape ('Port'); // get '% u53E3 ′
Alert (String. fromCharCode ('000000'); // get 'Port'
Alert ('Port'. charCodeAt (0); // get '123'
Alert (encodeURI ('login'); // get '% E5 % 8F % A3 ′

We can see that the string can be obtained in the form of \ u + hexadecimal Unicode code, while the fromCharCode method accepts the 10th Unicode code, get the 'login' character '.

The second alert obtains '% u7545', which is an unstandard Unicode Code and is part of the URI's Percent encoding. However, this method has been officially rejected by W3C, this standard is not available in any RFC, which specifies the behavior of escape in the ECMA-262 standard, and is estimated to be temporary.
What is interesting is the fifth time alert got '% E5 % 8F % A3'. What is this? How can this problem be solved?

This is the more Percent encoding and Percent code used in the URI, as defined in RFC 3986.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More