JavaScript language support for Unicode character sets _ Basics

Source: Internet
Author: User
Tags ranges

Last month, I did a share, detailing the Unicode character set and the JavaScript language's support for it. Here is the lecture on this share.

--> -->

First, what is Unicode?

Unicode stems from a very simple idea: to include all the characters in the world in a set, the computer only supports this character set, you can display all the characters, no more garbled.

It starts at 0 and assigns a number to each symbol, which is called the code point. For example, the symbol for code point 0 is null (indicates that all bits are 0).

Copy Code code as follows:
u+0000 = null

In the u+, the hexadecimal number that follows is the code point of Unicode.

Currently, the latest version of Unicode is 7.0 version, a total of 109,449 symbols, of which the Chinese and Japanese Korean text is 74,500. It can be approximated that more than two-thirds of the world's existing symbols come from East Asian text. For example, the code point for Chinese "good" is hexadecimal 597D.

Copy Code code as follows:
u+597d = good

With so many symbols, Unicode is not a one-time definition, but a partition definition. Each zone can hold 65,536 (216) characters, called a plane (plane). Currently, there are 17 (25) planes, that is, the size of the entire Unicode character set is now 221.

The first 65,536 characters, called the basic Plane (abbreviated BMP), its code point range from 01 until 216-1, written in 16 is from u+0000 to U+FFFF. All of the most common characters are placed in this plane, which is the first plane that Unicode defines and publishes.

The rest of the characters are placed in the auxiliary Plane (abbreviated SMP), and the code point range ranges from u+010000 to U+10FFFF.

Ii. UTF-32 and UTF-8

Unicode only prescribes the code points for each character, and in what byte order the code point is represented, the encoding method is involved.

The most intuitive encoding method is that each code point is represented by four bytes, with byte content one by one corresponding to the code point. This coding method is called UTF-32. For example, code point 0 is represented by four bytes of 0, and the code point 597D is preceded by two bytes of 0.

Copy Code code as follows:
u+0000 = 0x0000 0000u+597d = 0x0000 597D

The advantage of UTF-32 is that the conversion rules are simple and intuitive, and the search efficiency is high. The disadvantage is wasting space, the same content of English text, it will be four times times larger than the ASCII encoding. This shortcoming is very fatal, cause in fact no one uses this kind of coding method, the HTML 5 standard is explicit, the webpage cannot encode into UTF-32.

What people really need is a space-saving coding method, which leads to the birth of UTF-8. UTF-8 is a variable length encoding that ranges from 1 bytes to 4 bytes. The more commonly used characters, the shorter the byte, the first 128 characters, using only 1 bytes, identical to the ASCII code.

Number Range byte 0x0000-0x007f10x0080-0x07ff20x0800-0xffff30x010000-0x10ffff4

Because of this space-saving feature, UTF-8 makes it the most common Web page encoding on the Internet. However, it has little to do with today's topic, I will not go deep, specific transcoding method, you can refer to the "character coding notes."

Iii. UTF-16 Introduction

UTF-16 coding is between UTF-32 and UTF-8, and it combines the characteristics of fixed length and variable length two coding methods.

Its coding rules are simple: the basic plane's characters occupy 2 bytes, and the auxiliary plane's characters occupy 4 bytes. That is, the encoding length of the UTF-16 is either 2 bytes (u+0000 to u+ffff) or 4 bytes (u+010000 to U+10FFFF).

So there's a question, when we encounter two bytes, how do we see it as a character, or do we need to read it with the other two bytes?

It's ingenious, and I don't know if it's intentional. Design, in the basic plane, from u+d800 to U+dfff is an empty segment, that is, these code points do not correspond to any characters. Therefore, this empty segment can be used to map the characters of the auxiliary plane.

Specifically, there are 220 character bits for the auxiliary plane, that is, 20 bits are required to correspond to these characters. UTF-16 the 20 bits into two halves, the first 10 bits are mapped in u+d800 to U+DBFF (space size 210), called High (H), and the latter 10 bits are mapped u+dc00 to U+DFFF (space size 210), called Low (L). This means that a character of an auxiliary plane is represented by a character that is split into two basic planes.

So, when we encounter two bytes, we find that its code points between u+d800 and U+DBFF, we can conclude that the following two-byte code points, should be between u+dc00 to U+dfff, the four bytes must be read together.

Four, UTF-16 formula of the transfer code

When a Unicode code point is converted to a UTF-16, it first distinguishes between a basic planar character or a secondary planar character. If it is the former, the code point is converted directly into the corresponding 16-byte form, with a length of two bytes.

Copy Code code as follows:
u+597d = 0x597d

If it is a secondary planar character, Unicode version 3.0 gives a formula for the transcoding.

Copy Code code as follows:
H = Math.floor ((c-0x10000)/0x400) +0xd800l = (c-0x10000)% 0x400 + 0xdc00

Take the character as an example, it is an auxiliary plane character, the code point is u+1d306, the process of converting it to UTF-16 is as follows.

Copy Code code as follows:
H = Math.floor ((0x1d306-0x10000)/0x400) +0xd800 = 0xd834l = (0x1d306-0x10000)% 0X400+0XDC00 = 0xdf06

So, the UTF-16 encoding of a character is 0xd834 DF06, which is four bytes long.

What kind of coding does JavaScript use?

The JavaScript language takes the Unicode character set, but only one encoding method is supported.

This encoding is neither UTF-16 nor UTF-8, nor is it UTF-32. The above coding methods, JavaScript are not.

JavaScript uses a ucs-2!.

Six, UCS-2 code

Why the sudden killing of a UCS-2? This requires a little history.

The internet has not appeared in the era, there were two teams, coincidentally want to engage in uniform character set. One is the Unicode team established in 1989, and the other is the UCS team, which was established earlier and 1988. When they found out about each other, they soon agreed: the world does not need two sets of uniform character sets.

October 1991, two teams decided to merge character sets. That is, from now on, only a set of character sets, Unicode, and revisions of the previously published character set, UCS's code points will be exactly the same as Unicode.

The reality at the time was that UCS was developing faster than Unicode, and as early as 1990, the first set of coding methods was published UCS-2, using 2 bytes to denote characters that already had code points. (At that time there is only one plane, is the basic plane, so 2 bytes is enough.) UTF-16 encoding is not published until July 1996, it is explicitly declared to be a superset of UCS-2, that is, the basic plane character follows the UCS-2 encoding, and the auxiliary plane character defines a 4-byte representation.

The relationship between the two is simply that UTF-16 replaced the UCS-2, or UCS-2 integration into the UTF-16. So, now only UTF-16, no UCS-2.

Seven, the birth background of JavaScript

So why is JavaScript not choosing a more advanced UTF-16 and using a UCS-2 that has already been eliminated?

The answer is very simple: do not want to also, can not also. Because there is no UTF-16 encoding when the JavaScript language appears.

In May 1995, Brendan eich the JavaScript language for 10 days; in October, the first explanation engine came out, and in November, Netscape formally submitted language standards to ECMA (see the "JavaScript nativity" for the whole process). Comparing the release time of UTF-16 (July 1996), it will be understood that Netscape had no other choice, only UCS-2 a coding method available!

Viii. limitations of JavaScript character functions

Since JavaScript can only handle UCS-2 encoding, all characters are 2 bytes in this language, and 4-byte characters are treated as two double-byte characters. The character functions of JavaScript are affected by this and cannot return the correct result.

Or as an example of a character, its UTF-16 encoding is a 4-byte 0xd834 DF06. The problem is that the 4-byte encoding does not belong to Ucs-2,javascript, and will only see it as a separate two-character u+d834 and u+df06. As I said before, these two yards are empty, so JavaScript is considered to be a string of two empty characters!

The code above indicates that JavaScript thinks that the length of the character is 2, the first character taken is a null character, and the code point of the first character is 0xdb34. These results are not correct!

To solve this problem, you must make a judgment on the code point and then manually adjust it. The following is the correct notation for traversing the string.

Copy Code code as follows:
while (++index < length) {//... if (charcode >= 0xd800 && charcode <= 0xDBFF) {Output.push (character + String.charat (++index)); else {Output.push (character);}}

The above code indicates that when traversing a string, you must make a judgment on the code point, as long as it falls in the range of 0xd800 to 0xDBFF, it should be read along with 2 bytes.

A similar problem exists in all JavaScript character manipulation functions.

String.prototype.replace () String.prototype.substring () String.prototype.slice () ...

The above functions are only valid for 2-byte code points. To properly handle a 4-byte code point, you must deploy your own version to determine the range of code points for the current character.

 Nine, ECMAScript 6

The next version of JavaScript, ECMAScript 6 (ES6), greatly enhances Unicode support, and basically solves the problem.

(1) Correct recognition of characters

ES6 can automatically identify 4-byte code points. Therefore, traversing the string is much simpler.

Copy Code code as follows:
For (let S of string) {//...}

However, in order to remain compatible, the length property is the original behavior. In order to get the correct length of the string, you can use the following method.

Copy Code code as follows:
Array.from (string). length

(2) Code point notation

JavaScript allows the Unicode character to be represented directly by a code point, which is written as "slash +u+ code point".

Copy Code code as follows:
' good ' = = = ' \u597d '/True

However, this notation is not valid for a 4-byte code point. ES6 fixes this problem, as long as the code point in the curly braces, you can correctly identify.

(3) String handler function

ES6 has added several functions that specialize in handling 4-byte code points.

String.fromcodepoint (): Returns the corresponding character from the Unicode code point String.prototype.codePointAt (): Returns the corresponding code point from the character String.prototype.at () : Returns the character at the given position of the string

(4) Regular expressions

ES6 provides the U modifier to add 4-byte code point support for regular expressions.

(5) Unicode regularization

Some characters have additional symbols in addition to letters. For example, the ǒ of Hanyu Pinyin, the tone above the letter is an additional symbol. For many European languages, tonal symbols are very important.

Unicode provides two representations of the method. One is a single character with an additional symbol, that is, a code point represents a character, such as the ǒ of the code point is u+01d1; the other is to use the additional symbol as a single code point, combined with the main character, that is, two code points represent a character, such as Ǒ can be written as O (u+004f) +ˇ (u+030c).

Copy Code code as follows:

Method One
' \u01d1 '
' Ǒ '

Method Two
' \u004f\u030c '
' Ǒ '

These two representations, both visual and semantic, are exactly the same and should be treated as equals. However, JavaScript cannot be distinguished.

Copy Code code as follows:

' \u01d1 ' = = ' \u004f\u030c '
False

ES6 provides a normalize method that allows "Unicode normalization" to convert both methods to the same sequence.

Copy Code code as follows:
' \u01d1 '. Normalize () = = ' \u004f\u030c '. Normalize ()/True

For more information on ES6, please see "ECMAScript 6".

==========================

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.