Unicode that every JavaScript developer should understand

Source: Internet
Author: User
Every JavaScript developer should understand the Unicode Directory:

1. thoughts behind Unicode

2 Basic Unicode concepts

2.1 characters and code points

2.2 Unicode plane

2.3 yuan

2.4 proxy pair

2.5 combination of Characters

3 Unicode in JavaScript

3.1 escape sequence

3.2 string comparison

3.3 String Length

3.4 character Positioning

3.5 Regular Expression matching

4 Conclusion

1. ideas behind Unicode

First, ask the most basic question: How do you read and understand this article? The answer is simple, because you understand these words and their meanings.

Then how do you understand the meaning of these words? The answer is also simple, because you (Reader) and I (author) share the same perception of the connection between these (displayed on the screen) images and Chinese characters (meaning.

For computers, this principle is almost the same. There is only one difference: computers do not understand the meaning of these words (letters), but simply understand them as specific bit sequences.

Let's imagine a scenario where computer User1 sends a message 'hello' to computer user2 '.

The computer does not know the meaning of these letters. Therefore, the computer User1 converts the message 'hello' to a string of numbers in sequence 0x68 0x65 0x6C 0x6C 0x6F. Each letter corresponds to a number: h corresponds to 0x68, e corresponds to 0x65, and so on.

Then, send the numbers to user2.

After the computer User2 receives the number sequence 0x68 0x65 0x6C 0x6C 0x6F, it uses the correspondence between the same letter and number to reconstruct the message content, and the 'hello' will be correctly displayed.

The Protocol for correspondence between letters and numbers between different computers is the result of Unicode standardization.

According to Unicode, h is an abstract character named latin small letter h. This abstract character corresponds to the number 0x68, which is a code point marked as U + 0068. These concepts will be described in the next chapter.

Unicode provides an abstract character list (character set) and assigns each character a unique identifier code point (encoding character set ).

2. Basic Unicode concepts

The website www.unicode.org mentions:

Unicode assigns a unique number to each character.

Platform-independent

No program

Regardless of Language

Unicode is a universal character set. It defines the character sets of most writing systems in the world and assigns a unique number (Code point) to each character ).

Unicode contains most modern languages, punctuation marks, additional symbols (notes), mathematical symbols, technical symbols, arrows, and emojis.

The first version of Unicode 1.0 was released in October 1991 and contains 7161 characters. The latest version 9.0 (released in December June 2016) provides a code of 128172 characters.

The universality and openness of Unicode solve a problem that has always existed in the past: suppliers implement different character sets and encoding rules, which is difficult to handle.

Creating an application that supports all character sets and encoding rules is complex. Not to mention the encoding you selected may not support all the languages you need.

If you think Unicode is hard, think about it. If you don't program it, it will be more difficult.

I still remember when I randomly selected the desired character set and encoding rules to read the file content. All depends on the character!

2.1 characters and code points

Abstract characters (text characters) are information units used to organize, manage, or represent text data.

The character in Unicode is an abstract concept. Each abstract character has A corresponding name, such as latin small letter. The image representation (glyph) of this abstract character is. (Translator's note: glyph is an image character)

A code point is a number assigned to an abstract character.

Code point to U + U + represents the Unicode prefix, while Is a hexadecimal number. For example, both U + 0041 and U + 2603 are code points.

The value range of code points is from U + 0000 to U + 10 FFFF.

Remember that the Code point is a simple number. Remember this when thinking about Unicode.

The Code point is like the subscript of the array element.

The magic of Unicode is that it associates code points with abstract characters. For example, the abstract character "U + 0041" is latin capital letter a (represented as A), and the abstract character "U + 2603" is "SNOWMAN" (represented)

Note that not all code points have abstract characters. There are 114112 available code points, but only 128237 abstract characters are allocated.

2.2 Unicode plane

A plane is the range from U + n0000 to U + nFFFF, that is, 65536 (1000016) consecutive Unicode code points. The value range of n is from 016 to 1016.

These planes divide Unicode code points into 17 equal-size collections:

Plane 0 contains code points from U + 0000 to U + FFFF

Flat 1 contains code points from U + ** 1 ** 0000 to U + ** 1 ** FFFF

...

Plane 16 contains code points from U + ** 10 ** 0000 to U + ** 10 ** FFFF

Basic multilingual plane

Plane 0 is special. It is called a basic multilingual plane or BMP for short. It contains characters (basic Latin letters, Spanish letters, Greek letters, etc.) and a large number of symbols in most modern languages.

As described above, the value range of the code points in the basic multi-text plane is from U + 0000 to U + FFFF, which can contain up to four hexadecimal numbers.

Most of the time, developers process characters in BMP. It contains the required characters in most cases.

Some Characters in BMP:

E corresponds to the Code Point U + 0065 abstract character name: LATIN SMALL LETTER E

| Corresponding code Point U + 007C abstract character name: VERTICAL BAR

■ Corresponding code Point U + 25A0 abstract character name: BLACK SQUARE

Corresponding Code Point U + 2602 abstract character name: UMBRELLA

Starlight plane

16 planes after BMP (PLANE 1, plane 2 ,..., Plane 16) is called a starlight plane or an auxiliary plane.

The code points in the starlight plane are called Starlight code points. The value range of these code points is from U + 10000 to U + 10 FFFF.

Starlight code points may have five or six hexadecimal numbers: U + ddddd or U + dddddd.

Let's look at the characters in the starlight plane:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.