Reference article:
Http://www.ruanyifeng.com/blog/2014/12/unicode.html
Unicode comes from a very simple idea: to include all the characters of the world in a single set, the computer can display all the characters as long as it supports this character set, and no more garbled.
It starts at 0 and assigns a number to each symbol, which is called a code point.
Null
U+ indicates that the hexadecimal number immediately following the code point is Unicode.
The JavaScript language uses the Unicode character set, but only one encoding method is supported.
JavaScript uses ucs-2!.
Since JavaScript can only handle UCS-2 encoding, all characters are 2 bytes in the language, and if they are 4-byte characters, they are treated as two double-byte characters. JavaScript's character functions are affected by this and cannot return the correct results.
The next version of JavaScript, ECMAScript 6 (abbreviated ES6), greatly enhanced Unicode support, which basically solves this problem.
(1) correct character recognition
The ES6 can automatically identify 4-byte code points. As a result, traversing strings is much simpler.
for (let S of String) { // ...}
However, in order to maintain compatibility, the length property is the original behavior. In order to get the correct length of the string, you can use the following method.
Array.from (string). length
(2) Code point notation
JavaScript allows code points to be used to represent Unicode characters, which are "backslash +u+ code points".
// true
However, this notation is not valid for 4-byte code points. ES6 fixed this problem, as long as the code point in the curly braces, it can be correctly identified.
(3) String processing function
ES6 has added several functions that specialize in handling 4-byte code points.
- String.fromcodepoint (): Returns the corresponding character from a Unicode code point
- String.prototype.codePointAt (): Returns the corresponding code point from the character
- String.prototype.at (): Returns the character of the given position of the string
(4) Regular expressions
ES6 provides a U modifier to add support for 4-byte code points to regular expressions.
(5) Unicode Normalization
Some characters have additional symbols in addition to the letters. For example, the ǒ of Hanyu Pinyin, the tones above the letters are attached symbols. For many European languages, the tone symbol is very important.
Unicode provides two methods of representation. One is a single character with an additional symbol, that is, a code point for a character, such as the code point of Ǒ is U+01D1, the other is the additional symbol as a code point, and the main character compound display, that is, two code points to represent a character, such as Ǒ can be written O (u+004f) +ˇ (u+030c).
// method One ' \u01d1 '// ' ǒ '// method Two ' \u004f\u030c '// ' ǒ '
These two representations, both visual and semantic, should be treated as equivalent situations. However, JavaScript cannot be distinguished.
' \u01d1 ' = = = ' \u004f\u030c ' //false
ES6 provides a normalize method that allows "Unicode normalization", which turns both methods into the same sequence.
' \u01d1 '. Normalize () = = = ' \u004f\u030c '. Normalize ()// true
Unicode and JavaScript