Unicode encoding, Base: It assigns a unique integer to each character unit of all word systems in the world, which is between 0~1114111 and is called a code point in Unicode terms.
and other character encodings are almost no different (for example, ASCII).
The difference is that ASCII maps each index to a unique binary representation, but Unicode allows multiple code points with different binary encodings.
Different encodings weigh between the number of strings required to store and the speed of operations.
Currently the most popular Unicode encoding methods are: utf-8,utf-16,utf-32.
Based on historical data, Unicode incorrectly estimates the easy range of code points.
Initially, it was thought that only 216 code points were needed, so UCS-2 was produced, which was the original standard for 16-bit encoding. A code point can hold 16-bit numbers, and the simple way to do this is to map the code point to its encoded element one-on-another, called a code unit.
UCS-2 is made up of independent 16-bit code units, each of which corresponds to a separate Unicode code point . The main benefit of this encoding method is that the index string is a small, fixed-time operation of the code . Getting the nth code point of a string simply simply selects the nth 16-bit element of the array.
The following example:
This string, in which each character consists of a code point in the original 16-bit range. for Unicode strings, code points and encoded elements can match exactly .
The JS string uses this 16-bit encoding for each element. If you also maintain the practice of the early 1990s, each element of the JS string corresponds to a separate code point.
Unicode has been extended from 216 to 220 code points. The newly added range is organized into 17 sub-ranges with a size of 216 code points.
The first child range, called the base multilingual Plane , contains the first 216 code points. The remaining 16 ranges are called auxiliary planes .
Once the scope of the code point expands, the UCS-2 becomes obsolete. It needs to be extended to represent these additional code points. The base replacement UTF-16 is similar to it.
UTF-16 uses proxy pairs to represent additional code points . A pair of 16-bit code units together encode a code point that is equal to or greater than 216. (
a bit of a mess, that's it.) A proxy pair equals two 16-bit units of code. A code unit is a code point that maps one to the other with its encoded elements
. )
As an example:
The code point for the treble clef "" is u+1d11e (the idiomatic 16 notation for Unicode in code point 119070)
It is represented by the code unit 0xd834 and 0xddle in the UTF-16 format. code points can be decoded by merging the bits selected by the two code units . (This encoding guarantees that these proxies will never be confused with valid BMP code points, even from somewhere in the middle of the string, or at any time to identify a proxy pair.) )
Because each code point encoding for UTF-16 requires one or two 16-bit code units, UTF-16 is a variable-length encoding .
- The size of a string of length n in memory varies based on the string-specific code point.
- Finding the nth code point in a string is no longer a fixed-time operation.
- The search needs to proceed from the beginning of the string.
When Unicode expands in size, JS has adopted a 16-bit string element. String properties and methods are based on the code unit hierarchy, not the code point level.
So whenever a string contains a code point in a secondary plane, JS represents each code point as two elements instead of one (the code point for a pair of UTF-16 proxies)
The element of a JS string is a 16-bit unit of code.
Extracting a character from a string gets a unit of code, not a code point.
Regular expressions also work at the code-unit level. Its single-character pattern ("." ) to match a single unit of code.
JS built-in string data types work at the code unit level, but this does not prevent some APIs from being aware of code points and proxy pairs. Some standard ECMAScript libraries correctly handle proxy pairs.
URI manipulation functions: Sendcodeuri,decodeuri,encodeuricomponent and decodeURIComponent.
Tips
- The JS string is made up of 16-bit code units, not a Unicode code point.
- JS uses two code units to represent 216 and more of the Unicode code points. These two code units are called proxy pairs.
- The proxy pair threw away the count of string elements, and the Length,charat,charcodeat method and the regular expression pattern were affected.
- Use a third-party library to write string operations that recognize code points.
- Whenever you use a library with a string operation, you need to look at the library document as if it were the entire range of code points.
Postscript
This section see I am very B, the whole do not understand, my usual work environment, has not encountered this aspect of the bug.
Page encoding is utf-8 or GBK, whether it can not consider the above said?
Now just know how to store each case in the memory stored procedure.
Think further, you can go to the Internet to look for information.
Further reading
On the internet to find a few write this article, is interested to read it yourself.
- A few simple words summarizing unicode,utf-8 and UTF-16
- Unicode (UTF-8, UTF-16) confusing concept
- Why is UTF-8 encoding more widely used than UTF-16 encoding?
- The difference and relationship between UTF-8 GBK UTF8 GB2312
[Effective JavaScript note] 7th: A sequence of code elements that are treated as a 16-bit string