Unicode that every JavaScript developer should understand

Last Update:2017-05-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Every JavaScript developer should understand the Unicode Directory:

1. thoughts behind Unicode

2 Basic Unicode concepts

2.1 characters and code points

2.2 Unicode plane

2.3 yuan

2.4 proxy pair

2.5 combination of Characters

3 Unicode in JavaScript

3.1 escape sequence

3.2 string comparison

3.3 String Length

3.4 character Positioning

3.5 Regular Expression matching

4 Conclusion

1. ideas behind Unicode

First, ask the most basic question: How do you read and understand this article? The answer is simple, because you understand these words and their meanings.

Then how do you understand the meaning of these words? The answer is also simple, because you (Reader) and I (author) share the same perception of the connection between these (displayed on the screen) images and Chinese characters (meaning.

For computers, this principle is almost the same. There is only one difference: computers do not understand the meaning of these words (letters), but simply understand them as specific bit sequences.

Let's imagine a scenario where computer User1 sends a message 'hello' to computer user2 '.

The computer does not know the meaning of these letters. Therefore, the computer User1 converts the message 'hello' to a string of numbers in sequence 0x68 0x65 0x6C 0x6C 0x6F. Each letter corresponds to a number: h corresponds to 0x68, e corresponds to 0x65, and so on.

Then, send the numbers to user2.

After the computer User2 receives the number sequence 0x68 0x65 0x6C 0x6C 0x6F, it uses the correspondence between the same letter and number to reconstruct the message content, and the 'hello' will be correctly displayed.

The Protocol for correspondence between letters and numbers between different computers is the result of Unicode standardization.

According to Unicode, h is an abstract character named latin small letter h. This abstract character corresponds to the number 0x68, which is a code point marked as U + 0068. These concepts will be described in the next chapter.

Unicode provides an abstract character list (character set) and assigns each character a unique identifier code point (encoding character set ).

2. Basic Unicode concepts

The website www.unicode.org mentions:

Unicode assigns a unique number to each character.

Platform-independent

No program

Regardless of Language

Unicode is a universal character set. It defines the character sets of most writing systems in the world and assigns a unique number (Code point) to each character ).

Unicode contains most modern languages, punctuation marks, additional symbols (notes), mathematical symbols, technical symbols, arrows, and emojis.

The first version of Unicode 1.0 was released in October 1991 and contains 7161 characters. The latest version 9.0 (released in December June 2016) provides a code of 128172 characters.

The universality and openness of Unicode solve a problem that has always existed in the past: suppliers implement different character sets and encoding rules, which is difficult to handle.

Creating an application that supports all character sets and encoding rules is complex. Not to mention the encoding you selected may not support all the languages you need.

If you think Unicode is hard, think about it. If you don't program it, it will be more difficult.

I still remember when I randomly selected the desired character set and encoding rules to read the file content. All depends on the character!

2.1 characters and code points

Abstract characters (text characters) are information units used to organize, manage, or represent text data.

The character in Unicode is an abstract concept. Each abstract character has A corresponding name, such as latin small letter. The image representation (glyph) of this abstract character is. (Translator's note: glyph is an image character)

A code point is a number assigned to an abstract character.

Code point to U + U + represents the Unicode prefix, while Is a hexadecimal number. For example, both U + 0041 and U + 2603 are code points.

The value range of code points is from U + 0000 to U + 10 FFFF.

Remember that the Code point is a simple number. Remember this when thinking about Unicode.

The Code point is like the subscript of the array element.

The magic of Unicode is that it associates code points with abstract characters. For example, the abstract character "U + 0041" is latin capital letter a (represented as A), and the abstract character "U + 2603" is "SNOWMAN" (represented)

Note that not all code points have abstract characters. There are 114112 available code points, but only 128237 abstract characters are allocated.

2.2 Unicode plane

A plane is the range from U + n0000 to U + nFFFF, that is, 65536 (1000016) consecutive Unicode code points. The value range of n is from 016 to 1016.

These planes divide Unicode code points into 17 equal-size collections:

Plane 0 contains code points from U + 0000 to U + FFFF

Flat 1 contains code points from U + ** 1 ** 0000 to U + ** 1 ** FFFF

...

Plane 16 contains code points from U + ** 10 ** 0000 to U + ** 10 ** FFFF

Basic multilingual plane

Plane 0 is special. It is called a basic multilingual plane or BMP for short. It contains characters (basic Latin letters, Spanish letters, Greek letters, etc.) and a large number of symbols in most modern languages.

As described above, the value range of the code points in the basic multi-text plane is from U + 0000 to U + FFFF, which can contain up to four hexadecimal numbers.

Most of the time, developers process characters in BMP. It contains the required characters in most cases.

Some Characters in BMP:

E corresponds to the Code Point U + 0065 abstract character name: LATIN SMALL LETTER E

| Corresponding code Point U + 007C abstract character name: VERTICAL BAR

■ Corresponding code Point U + 25A0 abstract character name: BLACK SQUARE

Corresponding Code Point U + 2602 abstract character name: UMBRELLA

Starlight plane

16 planes after BMP (PLANE 1, plane 2 ,..., Plane 16) is called a starlight plane or an auxiliary plane.

The code points in the starlight plane are called Starlight code points. The value range of these code points is from U + 10000 to U + 10 FFFF.

Starlight code points may have five or six hexadecimal numbers: U + ddddd or U + dddddd.

Let's look at the characters in the starlight plane:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Unicode that every JavaScript developer should understand

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Unicode that every JavaScript developer should understand

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support