Detailed support for Unicode character sets in JavaScript, including javascriptunicode

Last Update:2015-01-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Last month, I shared a detailed description of the Unicode Character Set and the support of the JavaScript language. The following is the lecture for this sharing.

1. What is Unicode?

Unicode comes from a very simple idea: to include all the characters in the world in a set, as long as the computer supports this character set, it can display all the characters and there will be no garbled characters.

It starts from 0 and specifies a number for each symbol, which is called "code point ). For example, the Code point 0 is null (indicating that all binary bits are 0 ).

Copy codeThe Code is as follows: U + 0000 = null

In the above formula, U + indicates that the hexadecimal number followed by it is the Unicode Code Point.

Currently, the latest version of Unicode is version 7.0, with a total revenue of 109449 characters, of which 74500 are Chinese and Japanese characters. It can be considered that more than 2/3 of the existing symbols in the world come from East Asian text. For example, the Chinese "good" code is 597D in hexadecimal format.

Copy codeThe Code is as follows: U + 597D = Good

With so many symbols, Unicode is not defined at one time, but partition. Each zone can contain 65536 (216) characters, called a plane ). Currently, there are 17 (25) planes in total, that is, the size of the Unicode Character Set is now 221.

The first 65536 characters are called Basic planes (BMP). Its code points range from 0 to 216-1, the hexadecimal format is from U + 0000 to U + FFFF. All the most common characters are placed on this plane, which is the first plane defined and published by Unicode.

All the remaining characters are placed on the secondary plane (SMP). The code points range from U + 010000 to U + 10 FFFF.

2. UTF-32 and UTF-8

Unicode only specifies the vertices of each character. The encoding method is involved when the vertices are expressed in bytes.

The most intuitive encoding method is that each code point is represented by four bytes, and the content of each byte corresponds to one-to-one code points. This encoding method is called UTF-32. For example, the Code point 0 is represented by four bytes of 0, and the code point 597D is preceded by two bytes of 0.

Copy codeThe Code is as follows: U + 0000 = 0x0000 running U + 597D = 0x0000 597D

The advantage of UTF-32 is that the conversion rules are simple and intuitive, and the search efficiency is high. The disadvantage is that it is a waste of space. The English text of the same content is four times more than ASCII code. This disadvantage is fatal, resulting in no one actually uses this encoding method, HTML 5 standard on plaintext provisions, web pages cannot be encoded into a UTF-32.

What people really need is a space-saving coding method, which leads to the birth of UTF-8. UTF-8 is a variable-length encoding method that ranges from 1 byte to 4 bytes. The more common the characters are, the shorter the byte, the first 128 characters are represented by only one byte, which is exactly the same as the ASCII code.

Number range: byte 0x0000-0x007F10x0080-0x07FF20x0800-0xFFFF30x010000-0x10FFFF4

Because of the space-saving feature of UTF-8, it becomes the most common webpage code on the Internet. However, it has little to do with today's theme and I will not go into depth. For details about the transcoding method, refer to character encoding notes.

3. Introduction to UTF-16

UTF-16 coding is between UTF-32 and UTF-8, and the characteristics of the two encoding methods are combined.

Its encoding rules are simple: the character in the basic plane occupies 2 bytes, and the character in the secondary plane occupies 4 bytes. That is to say, the length of the UTF-16 is either 2 bytes (U + 0000 to U + FFFF), or 4 bytes (U + 010000 to U + 10 FFFF ).

So there is a question: when we encounter two bytes, how can we see that it is a single character or need to be interpreted together with the other two bytes?

It is clever. I don't know whether it is intentional or not. In the basic plane, from U + D800 to U + DFFF is an empty segment, that is, these vertices do not correspond to any characters. Therefore, this empty segment can be used to map characters in the secondary plane.

Specifically, there are a total of 220 character bits in the secondary plane. That is to say, at least 20 binary bits are required for these characters. The UTF-16 splits the 20 bits into two halves, and the first 10 bits are mapped to U + D800 to U + DBFF (space size 210), which is called high (H ), the last 10 bits are mapped to U + DC00 to U + DFFF (space size: 210), which is called low position (L ). This means that a character in the secondary plane is split into two characters in the basic plane.

Therefore, when we encounter two bytes and find that their code points are between U + D800 and U + DBFF, we can conclude that the two byte points that follow closely follow are, it should be between U + DC00 and U + DFFF, and the four bytes must be interpreted together.

Iv. transcoding formula for UTF-16

When converting a Unicode code point to a UTF-16, first distinguish between basic Flat Characters and secondary flat characters. If it is the former, the Code point is directly converted into the corresponding hexadecimal format, with a length of two bytes.

Copy codeThe Code is as follows: U + 597D = 0x597D

For secondary Flat Characters, Unicode 3.0 provides the transcoding formula.

Copy codeCode: H = Math. floor (c-0x10000)/0x400) + 0xD800L = (c-0x10000) % 0x400 + 0xDC00

Take the character as an example, it is a secondary flat character, the Code point is U + 1D306, it is converted into the UTF-16 calculation process is as follows.

Copy codeThe Code is as follows: H = Math. floor (0x1D306-0x10000)/0x400) + 0xD800 = 0xD834L = (0x1D306-0x10000) % 0x400 + 0xDC00 = 0xDF06

Therefore, the character UTF-16 encoding is 0xD834 DF06, length is four bytes.

5. What encoding does JavaScript use?

JavaScript uses the Unicode Character Set, but only supports one encoding method.

This encoding is neither a UTF-16 nor a UTF-8, nor a UTF-32. The above encoding methods are not used in JavaScript.

JavaScript uses UCS-2!

6. UCS-2 Coding

How to suddenly take out a UCS-2? This requires a bit of history.

In the era when the Internet was not yet available, there were two teams that wanted to unify character sets. One is the Unicode team established in 1989, and the other is the later, founded in 1988. When they discovered the existence of the other party, they quickly reached an agreement: there is no need for two sets of unified character sets in the world.

In October 1991, the two teams decided to merge character sets. That is to say, from now on, only one set of character sets will be released, that is, Unicode, and the previously released character sets will be revised. The UCS code points will be exactly the same as Unicode.

At that time, the actual situation is that the development of the UCOS is faster than Unicode, as early as 1990, announced the first set of encoding method UCS-2, using 2 bytes to indicate that there is already a code point of the characters. (At that time, there was only one plane, which was the basic plane, so it was enough to have two bytes .) UTF-16 encoding was announced later than July 1996, clearly announced is the super set of UCS-2, that is, the basic flat character follows the UCS-2 encoding, secondary flat Character defines the expression of 4 bytes.

The relationship between the two is simply that the UTF-16 replaces the UCS-2, or the UCS-2 is integrated into the UTF-16. So now only UTF-16, no UCS-2.

VII. Background of JavaScript

So, why does JavaScript not choose a more advanced UTF-16, but with the already obsolete UCS-2?

The answer is simple: Neither do nor do you. Because at the time of the JavaScript language, there was no UTF-16 encoding.

In May 1995, Brendan Eich designed the JavaScript language for 10 days. In October, the first engine of interpretation was launched. In November, netscape officially submits language standards to ECMA (for details about the entire process, see JavaScript birthnotes). Compared to the UTF-16 release time (July 1996), it will understand that Netscape was no other choice at that time, only one UCS-2 encoding method available!

VIII. Limitations of JavaScript character Functions

Since JavaScript can only process UCS-2 encoding, all characters are 2 bytes in this language. If it is 4 bytes, it will be treated as two dubyte characters. JavaScript character functions are affected by this and cannot return correct results.

Take the character as an example, its UTF-16 is 4 bytes 0xD834 DF06. The problem is that the 4-byte encoding does not belong to the UCS-2, JavaScript does not know, it will only be seen as a separate two character U + D834 and U + DF06. As mentioned above, these two vertices are empty, so JavaScript will regard them as a string consisting of two null characters!

The code above indicates that JavaScript considers the length of the character to be 2, the first character obtained is a null character, and the code point of the first character obtained is 0xDB34. These results are incorrect!

To solve this problem, you must make a judgment on the vertices and adjust them manually. The following is the correct way to traverse strings.

Copy codeThe Code is as follows: while (++ index <length ){//... if (charCode> = 0xD800 & charCode <= 0 xDBFF) {output. push (character + string. charAt (++ index);} else {output. push (character );}}

The code above indicates that when traversing a string, you must make a judgment on the Code Point, as long as it falls in the range from 0xD800 to 0xDBFF, it will be read together with the next two bytes.

Similar problems exist in all JavaScript character operation functions.

String. prototype. replace () String. prototype. substring () String. prototype. slice ()...

The above functions are only valid for 2-byte vertices. To correctly process 4-byte vertices, you must deploy your own version one by one to determine the vertices range of the current character.

　9. ECMAScript 6

The next version of JavaScript, ECMAScript 6 (ES6), significantly enhanced Unicode support and basically solved this problem.

(1) Correct Character Recognition

ES6 can automatically recognize 4-byte vertices. Therefore, it is much easier to traverse strings.

Copy codeThe Code is as follows: for (let s of string ){//...}

However, to maintain compatibility, the length attribute is still the original behavior method. To get the correct length of the string, you can use the following method.

Copy codeThe Code is as follows: Array. from (string). length

(2) code point representation

JavaScript allows Unicode characters to be expressed by vertices. The format is "slash + u + vertices ".

Copy codeThe Code is as follows: 'haok' = '\ u597D' // true

However, this representation is invalid for 4-byte vertices. ES6 fixed this problem and could be identified correctly by placing the vertices in braces.

(3) string processing functions

ES6 has added several functions dedicated to processing 4 bytecode points.

String. fromCodePoint (): returns the corresponding String from the Unicode Code Point. prototype. codePointAt (): returns the corresponding code point String from the character. prototype. at (): returns the character at the given position of the string.

(4) Regular Expression

ES6 provides the u modifier and supports adding 4 bytecode points to regular expressions.

(5) Unicode Regularization

Some characters include letters and additional characters. For example, in Chinese pinyin, the tone on the letter is an additional symbol. Tone symbols are very important for many European languages.

Unicode provides two representation methods. One is a single character with an additional symbol, that is, a code point represents a character. For example, the serial code point is U + 01D1; the other is to use the additional symbol as a separate code point and display it in combination with the subject character. That is, two code points indicate a character. For example, a token can be written as O (U + 004F) + bytes (U + 030C ).

Copy codeThe Code is as follows:
// Method 1
'\ U01D1'
// 'Handler'

// Method 2
'\ U004F \ u030c'
// 'Handler'

The two representation methods have the same vision and semantics, and should be processed as an equivalent situation. However, JavaScript cannot be identified.

Copy codeThe Code is as follows:
'\ U01D1' = '\ u004F \ u030C'
// False

ES6 provides the normalize method that allows "Unicode normalization" to convert the two methods into the same sequence.

Copy codeThe Code is as follows: '\ u01D1'. normalize () = '\ u004F \ u030c'. normalize () // true

For more information about ES6, see ECMAScript 6.

======================================

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Detailed support for Unicode character sets in JavaScript, including javascriptunicode

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support