Effective JavaScript String Encoding Item 7

Last Update:2014-08-21 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

this series as effective JavaScript 's reading notes.

filedUnicode, maybe a lot of programmers will find this thing cumbersome, but in essence,Unicodeis not complicated. Every word in every language in the world is represented by an integer value, and the range is0to the1114111, this value isUnicodeThe term is calledCode Point. On the mapping of characters to shaping values,Unicodeand other coding methods such asASCIIthere is no difference.

However, Unicode There are many ways to encode, and ASCII there is only one way:

Character	Encoding method
ascii	ascii Encoding, e.g.
Unicode	UTF-8, UTF-16, UTF-32, etc

So why Unicode How many encodings do you have? Because the time and space requirements for the operation are not the same in different situations.

and at the beginning of the design,Unicodeestimate all theCode Pointscan be2of the -the second party, i.e.65536to express. This is the way to encodeUCS-2, it is the original forUnicodeof the -bit encoding method. By this way, each of theCode Pointcan use a -is represented by the value of the bit, which is referred to asCode Unit. The advantage of this representation is that theUnicodethe index operation of a string can be done in constant time, because all characters are -bit, i.e.2bits of the expression.

because of the convenience of this coding method, some platforms such as Java , JavaScript have adopted it. As a result,each character of aJavaScript string is represented by 2 bits.

and asUnicodethe extension of the character set,65536has not satisfied the demand, at presentUnicodethe number of characters in the character set has exceeded2of the -The second party. As a result, the newly added parts are organized into -a2of the -Sub-range consisting of the second party. (* 2^16 = 1114112, so the currentUnicodeof theCode PointRange is0-1114111)

the first child range to accommodate the original UCS-2 in the character set, it is also known as Basic Multilingual Plane (BMP) . The remainder of the range is called supplementary Planes.

in order to represent more characters, UCS-2 's successor UTF-16 , is designed like this:

forCode Pointgreater than or equal65536the characters, by a pair of -bit ofCode Unitrepresentation. ForCode Pointless than65536character, or just need to1a -bit ofCode Unitrepresentation. Therefore,UTF-16is a variable-length coding method, so theCode PointdoIndexingThe operation is not a constant time. It usually needs to be searched backwards from the beginning of the string.

forJavaScript, the string'slengthProperties,charAtas wellcharCodeAtmethods, all inCode Uniton the basis of work rather thanCode Point. Therefore, whenJavaScriptneed to be expressed forSupplementary Planein theCode Point, it will use twoCodeUnitto say, in short:

JavaScript string is by - bit of Code Unit composed of.

So, when you need to deal with BMP outside the Code Point can cause problems, because you can't rely on length Properties, charAt as well charCodeAt method. It is time to consider using some mature third-party libraries.

Summarize:

Java Script 16 bit code Unit composition, Instead of unicode Code point the composition.
65536 code point javascript code Units surrogate Pair
Surrogate Pair length charat charcodeat The way it works.
processing Code Point more than 65535 string, consider using a third-party library and consult its documentation.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Effective JavaScript String Encoding Item 7

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support