Unicode and JavaScript in detail [very good article turn]

Source: Internet
Author: User

Last month, I did a share, detailing the Unicode character set and the JavaScript language support for it. Here is the transcript of this share.

First, what is Unicode?

Unicode comes from a very simple idea: to include all the characters of the world in a single set, the computer can display all the characters as long as it supports this character set, and no more garbled.

It starts at 0 and assigns a number to each symbol, which is called a code point. For example, the symbol for code point 0 is null (indicating that all bits are 0).

u+0000 = null

Above, u+ indicates that the hexadecimal number immediately following the code point is Unicode.

Currently, the latest version of Unicode is the 7.0 version, with a total income of 109,449 symbols, of which the CJK characters are 74,500. It can be approximated that more than two-thirds of the world's existing symbols come from East Asian characters. For example, the Chinese "good" code point is the hexadecimal 597D.

u+597d = good

With so many symbols, Unicode is not a one-time definition, but a partition definition. Each zone can hold 65,536 (216) characters, called a plane (plane). Currently, there are 17 (25) planes, that is, the entire Unicode character set size is now 221.

The first 65,536 character bits, called the basic Plane (abbreviated BMP), its code point range is from 01 to 216-1, written as 16 binary is from u+0000 to U+FFFF. All the most common characters are placed on this plane, which is the first plane that Unicode defines and publishes.

The remaining characters are placed on the auxiliary Plane (abbreviated SMP), and the code point ranges from u+010000 to U+10FFFF.

Second, UTF-32 and UTF-8

Unicode only specifies the code point of each character, and what kind of byte order is used to denote this code point, it involves the encoding method.

The most intuitive encoding method is that each code point uses four bytes, and byte content one by one corresponds to the code point. This coding method is called UTF-32. For example, the code point 0 is four bytes of 0, the code point 597D is in front plus two bytes of 0.

u+0000 = 0x0000 0000u+597d = 0x0000 597D

The advantage of UTF-32 is that the conversion rules are simple and intuitive, and the search efficiency is high. The disadvantage lies in the waste of space, the same content of English text, it will be four times times larger than ASCII code. This disadvantage is very deadly, resulting in virtually no one using this coding method, the HTML 5 standard expressly stated that the Web page must not be encoded into UTF-32.

What people really need is a space-saving coding method, which leads to the birth of UTF-8. UTF-8 is a variable-length encoding that has a character length ranging from 1 bytes to 4 bytes. The more commonly used characters, the shorter the byte, the first 128 characters, using only 1 bytes, exactly the same as the ASCII code.

Numbering range Bytes
0x0000-0x007f 1
0x0080-0x07ff 2
0x0800-0xffff 3
0x010000-0x10ffff 4

Due to the space-saving nature of UTF-8, it has become the most common Web page encoding on the Internet. However, it is not related to today's theme, I do not go into the specific transcoding method, you can refer to the character code note.

Iii. introduction of UTF-16

UTF-16 coding is between UTF-32 and UTF-8, and combines the characteristics of two coding methods: fixed length and variable length.

Its coding rules are simple: the characters in the base plane occupy 2 bytes, and the characters in the auxiliary plane take up 4 bytes. In other words, the encoding length of the UTF-16 is either 2 bytes (u+0000 to u+ffff) or 4 bytes (u+010000 to U+10FFFF).

So there is a problem, when we encounter two bytes, how to see if it is a character, or need to be interpreted with the other two bytes?

It's clever, I don't know if it was intentional design, in the basic plane, from u+d800 to U+dfff is an empty segment, that is, these code points do not correspond to any characters. Therefore, this empty segment can be used to map the characters of the auxiliary plane.

Specifically, there are 220 character bits in the auxiliary plane, which means at least 20 bits are required to correspond to these characters. UTF-16 the 20 bits into two halves, the first 10 bits mapped to u+d800 to U+DBFF (space size 210), called High (H), and the last 10 bits mapped in u+dc00 to U+DFFF (space size 210), called Low (L). This means that a character in the auxiliary plane is represented by a character that is split into two basic planes.

So, when we encounter two bytes, and find its code point between u+d800 to U+DBFF, it can be concluded that immediately after the two-byte code point, should be between u+dc00 to U+dfff, the four bytes must be put together to interpret.

The transcoding formula of UTF-16

When a Unicode code point is turned into a UTF-16, the first distinction is whether this is a basic planar character or a secondary plane character. In the case of the former, the code point is converted directly to the corresponding 16-in form, with a length of two bytes.

u+597d = 0x597d

If it is a secondary plane character, the Unicode version 3.0 gives the transcoding formula.

H = Math.floor ((c-0x10000)/0x400) +0xd800l = (c-0x10000)% 0x400 + 0xdc00

In the case of a character, it is an auxiliary plane character, the code point is u+1d306, and the calculation process of converting it to UTF-16 is as follows.

H = Math.floor ((0x1d306-0x10000)/0x400) +0xd800 = 0xd834l = (0x1d306-0x10000)% 0X400+0XDC00 = 0xDF06

Therefore, the UTF-16 encoding of the character is 0xd834 DF06, which is four bytes in length.

V. What encoding does JavaScript use?

The JavaScript language uses the Unicode character set, but only one encoding method is supported.

This code is neither UTF-16, nor UTF-8, nor UTF-32. The coding methods above are not used by JavaScript.

JavaScript uses ucs-2!.

Six, UCS-2 code

How do you suddenly kill a UCS-2? This requires a little history.

The internet has not yet appeared in the era, there were two teams, invariably want to engage in a unified character set. One is the Unicode team that was established in 1989, and the other is the UCS team, which was set up earlier and 1988. By the time they discovered the other, they soon agreed: No two sets of uniform character sets are needed in the world.

In October 1991, two teams decided to merge the character set. That is, only one set of characters will be published henceforth, Unicode, and the code point of the UCS will be exactly the same as Unicode if the previously published character set is revised.

The reality was that the UCS was developing faster than Unicode, and as early as 1990, the first set of coding methods, UCS-2, was published, using 2 bytes to represent characters that already have code points. (At that time there was only one plane, which was the basic plane, so 2 bytes would suffice.) The UTF-16 code was not released until July 1996, and was clearly declared to be a superset of the UCS-2, that is, the basic plane character inherits the UCS-2 encoding, and the auxiliary plane character defines a 4-byte representation.

The relationship between the two is simply said that UTF-16 replaced the UCS-2, or UCS-2 integration into the UTF-16. So, now only UTF-16, no UCS-2.

Vii. The birth background of JavaScript

So why does JavaScript not choose a more advanced UTF-16 and use a UCS-2 that has already been eliminated?

The answer is simple: I don't want to, and I can't. Because there is no UTF-16 encoding when the JavaScript language appears.

In May 1995, Brendan Eich used 10 days to design the JavaScript language; in October, the first explanation engine came out; the following November, Netscape formally submitted the language standard to ECMA (see the birth of JavaScript for the entire process). Compare UTF-16 's release date (July 1996), you will understand that Netscape company had no other choice at that time, only UCS-2 a coding method available!

Viii. limitations of JavaScript character functions

Since JavaScript can only handle UCS-2 encoding, all characters are 2 bytes in the language, and if they are 4-byte characters, they are treated as two double-byte characters. JavaScript's character functions are affected by this and cannot return the correct results.

As an example of a character, its UTF-16 encoding is a 4-byte 0xd834 DF06. The problem is that the 4-byte encoding does not belong to Ucs-2,javascript, and it only sees it as a separate two-character u+d834 and u+df06. As I said earlier, these two code points are empty, so JavaScript will think of two empty characters in a string!

The code above indicates that JavaScript believes that the length of the character is 2, the first character taken is a null character, and the code point of the first character taken is 0xdb34. These results are not correct!

To solve this problem, you must make a judgment on the code point and then manually adjust it. The following is the correct notation for traversing strings.

while (++index < length) {  //...  if (charcode >= 0xD800 && charcode <= 0xDBFF) {    Output.push (character + String.charat (++index));  } else {    Output.push (character);  }}

The above code indicates that when traversing a string, it is necessary to make a judgment on the code point, as long as it falls in the range of 0xd800 to 0xDBFF, it should be read together with the following 2 bytes.

A similar problem exists in all JavaScript character manipulation functions.

  • String.prototype.replace ()
  • String.prototype.substring ()
  • String.prototype.slice ()
  • ...

The above functions are only valid for 2-byte code points. To properly handle 4-byte code points, you must deploy your own version one by one, judging the code point range of the current character.

Nine, ECMAScript 6

The next version of JavaScript, ECMAScript 6 (abbreviated ES6), greatly enhanced Unicode support, which basically solves this problem.

(1) correct character recognition

The ES6 can automatically identify 4-byte code points. As a result, traversing strings is much simpler.

For (let S of string) {  //...}

However, in order to maintain compatibility, the length property is the original behavior. In order to get the correct length of the string, you can use the following method.

Array.from (string). length

(2) Code point notation

JavaScript allows the direct use of code points to represent Unicode characters, which are "slash +u+ code points."

' good ' = = = ' \u597d '//True

However, this notation is not valid for 4-byte code points. ES6 fixed this problem, as long as the code point in the curly braces, it can be correctly identified.

(3) String processing function

ES6 has added several functions that specialize in handling 4-byte code points.

  • String.fromcodepoint (): Returns the corresponding character from a Unicode code point
  • String.prototype.codePointAt (): Returns the corresponding code point from the character
  • String.prototype.at (): Returns the character of the given position of the string

(4) Regular expressions

ES6 provides a U modifier to add support for 4-byte code points to regular expressions.

(5) Unicode Normalization

Some characters have additional symbols in addition to the letters. For example, the ǒ of Hanyu Pinyin, the tones above the letters are attached symbols. For many European languages, the tone symbol is very important.

Unicode provides two methods of representation. One is a single character with an additional symbol, that is, a code point for a character, such as the code point of Ǒ is U+01D1, the other is the additional symbol as a code point, and the main character compound display, that is, two code points to represent a character, such as Ǒ can be written O (u+004f) +ˇ (u+030c).

Method One ' \u01d1 '//' ǒ '//Method two ' \u004f\u030c '//' ǒ '

These two representations, both visual and semantic, should be treated as equivalent situations. However, JavaScript cannot be distinguished.

' \u01d1 ' = = = ' \u004f\u030c '  //false

ES6 provides a normalize method that allows "Unicode normalization", which turns both methods into the same sequence.

' \u01d1 '. Normalize () = = = ' \u004f\u030c '. Normalize ()  //True

For more information about ES6, see the ECMAScript 6 primer.

==========================

My speech is the above content, the same day PPT please see here.

Unicode and JavaScript in detail [very good Article Goto]

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.