this series as effective JavaScript 's reading notes.
filedUnicode, maybe a lot of programmers will find this thing cumbersome, but in essence,Unicodeis not complicated. Every word in every language in the world is represented by an integer value, and the range is0to the1114111, this value isUnicodeThe term is calledCode Point. On the mapping of characters to shaping values,Unicodeand other coding methods such asASCIIthere is no difference.
However, Unicode There are many ways to encode, and ASCII there is only one way:
Character |
Encoding method |
ascii |
ascii Encoding, e.g. |
Unicode |
UTF-8, UTF-16, UTF-32, etc |
So why Unicode How many encodings do you have? Because the time and space requirements for the operation are not the same in different situations.
and at the beginning of the design,Unicodeestimate all theCode Pointscan be2of the -the second party, i.e.65536to express. This is the way to encodeUCS-2, it is the original forUnicodeof the -bit encoding method. By this way, each of theCode Pointcan use a -is represented by the value of the bit, which is referred to asCode Unit. The advantage of this representation is that theUnicodethe index operation of a string can be done in constant time, because all characters are -bit, i.e.2bits of the expression.
because of the convenience of this coding method, some platforms such as Java , JavaScript have adopted it. As a result,each character of aJavaScript string is represented by 2 bits.
and asUnicodethe extension of the character set,65536has not satisfied the demand, at presentUnicodethe number of characters in the character set has exceeded2of the -The second party. As a result, the newly added parts are organized into -a2of the -Sub-range consisting of the second party. (* 2^16 = 1114112, so the currentUnicodeof theCode PointRange is0-1114111)
the first child range to accommodate the original UCS-2 in the character set, it is also known as Basic Multilingual Plane (BMP) . The remainder of the range is called supplementary Planes.
in order to represent more characters, UCS-2 's successor UTF-16 , is designed like this:
forCode Pointgreater than or equal65536the characters, by a pair of -bit ofCode Unitrepresentation. ForCode Pointless than65536character, or just need to1a -bit ofCode Unitrepresentation. Therefore,UTF-16is a variable-length coding method, so theCode PointdoIndexingThe operation is not a constant time. It usually needs to be searched backwards from the beginning of the string.
forJavaScript, the string'slengthProperties,charAtas wellcharCodeAtmethods, all inCode Uniton the basis of work rather thanCode Point. Therefore, whenJavaScriptneed to be expressed forSupplementary Planein theCode Point, it will use twoCodeUnitto say, in short:
JavaScript string is by - bit of Code Unit composed of.
So, when you need to deal with BMP outside the Code Point can cause problems, because you can't rely on length Properties, charAt as well charCodeAt method. It is time to consider using some mature third-party libraries.
Summarize:
- Java Script 16 bit code Unit composition, Instead of unicode Code point the composition.
- 65536 code point javascript code Units surrogate Pair
- Surrogate Pair length charat charcodeat The way it works.
- processing Code Point more than 65535 string, consider using a third-party library and consult its documentation.