Unicode and JavaScript (1)

Source: Internet
Author: User

Unicode and JavaScript (1)

Last month, I shared a detailed description of the Unicode Character Set and the support of the JavaScript language. The following is the lecture for this sharing.

1. What is Unicode?

Unicode comes from a very simple idea: to include all the characters in the world in a set, as long as the computer supports this character set, it can display all the characters and there will be no garbled characters.

It starts from 0 and specifies a number for each symbol, which is called "code point ). For example, the Code point 0 is null, indicating that all binary bits are 0 ).

 
 
  1. U+0000 = null 

In the above formula, U + indicates that the hexadecimal number followed by it is the Unicode Code Point.

Currently, the latest version of Unicode is version 7.0, with a total revenue of 109449 characters, of which 74500 are Chinese and Japanese characters. It can be considered that more than 2/3 of the existing symbols in the world come from East Asian text. For example, the Chinese "good" code is 597D in hexadecimal format.

 
 
  1. U + 597D = Good

With so many symbols, Unicode is not defined at one time, but partition. Each zone can contain 65536 characters), which is called a flat plane ). Currently, there are a total of 17 25) planes, that is, the size of the entire Unicode Character Set is now 221.

The first 65536 characters (BMP), which ranges from 0 to 216-1, the hexadecimal format is from U + 0000 to U + FFFF. All the most common characters are placed on this plane, which is the first plane defined and published by Unicode.

The remaining characters are all placed in the auxiliary plane abbreviation SMP). The Code point ranges from U + 010000 to U + 10 FFFF.

2. UTF-32 and UTF-8

Unicode only specifies the vertices of each character. The encoding method is involved when the vertices are expressed in bytes.

The most intuitive encoding method is that each code point is represented by four bytes, and the content of each byte corresponds to one-to-one code points. This encoding method is called UTF-32. For example, the Code point 0 is represented by four bytes of 0, and the code point 597D is preceded by two bytes of 0.

 
 
  1. U+0000 = 0x0000 0000 
  2. U+597D = 0x0000 597D 

The advantage of UTF-32 is that the conversion rules are simple and intuitive, and the search efficiency is high. The disadvantage is that it is a waste of space. The English text of the same content is four times more than ASCII code. This disadvantage is fatal, resulting in no one actually uses this encoding method, HTML 5 standard on plaintext provisions, web pages cannot be encoded into a UTF-32.

What people really need is a space-saving coding method, which leads to the birth of UTF-8. UTF-8 is a variable-length encoding method that ranges from 1 byte to 4 bytes. The more common the characters are, the shorter the byte, the first 128 characters are represented by only one byte, which is exactly the same as the ASCII code.

Number range Bytes
0x0000-0x007F 1
0x0080-0x07FF 2
0x0800-0 xFFFF 3
0x010000-0x10FFFF 4

3. Introduction to UTF-16

Because of the space-saving feature of UTF-8, it becomes the most common webpage code on the Internet. However, it has little to do with today's theme and I will not go into depth. For details about the transcoding method, refer to the character encoding notes I wrote many years ago.

UTF-16 coding is between UTF-32 and UTF-8, and the characteristics of the two encoding methods are combined.

Its encoding rules are simple: the character in the basic plane occupies 2 bytes, and the character in the secondary plane occupies 4 bytes. That is to say, the length of the UTF-16 is either 2 bytes U + 0000 to U + FFFF), or 4 bytes U + 010000 to U + 10 FFFF ).

So there is a question: when we encounter two bytes, how can we see that it is a single character or need to be interpreted together with the other two bytes?

It is clever. I don't know whether it is intentional or not. In the basic plane, from U + D800 to U + DFFF is an empty segment, that is, these vertices do not correspond to any characters. Therefore, this empty segment can be used to map characters in the secondary plane.

Specifically, there are a total of 220 character bits in the secondary plane. That is to say, at least 20 binary bits are required for these characters. The UTF-16 splits the 20 bits into two halves, and the first 10 bits are mapped to U + D800 to U + DBFF space of 210), called high H ), the size of the last 10 bits mapped to U + DC00 to U + DFFF is 210), which is called low-level L ). This means that a character in the secondary plane is split into two characters in the basic plane.

Therefore, when we encounter two bytes and find that their code points are between U + D800 and U + DBFF, we can conclude that the two byte points that follow closely follow are, it should be between U + DC00 and U + DFFF, and the four bytes must be interpreted together.

Iv. transcoding formula for UTF-16

When converting a Unicode code point to a UTF-16, first distinguish between basic Flat Characters and secondary flat characters. If it is the former, the Code point is directly converted into the corresponding hexadecimal format, with a length of two bytes.

 
 
  1. U+597D = 0x597D 

For secondary Flat Characters, Unicode 3.0 provides the transcoding formula.

 
 
  1. H = Math.floor ((c-0x10000) / 0x400)+0xD800 
  2. L = (c - 0x10000) % 0x400 + 0xDC0 

Take the character as an example, it is a secondary flat character, the Code point is U + 1D306, it is converted into the UTF-16 calculation process is as follows.

 
 
  1. H = Math.floor ((0x1D306-0x10000)/0x400)+0xD800 = 0xD834 
  2. L = (0x1D306-0x10000) % 0x400+0xDC00 = 0xDF06 

Therefore, the character UTF-16 encoding is 0xD834 DF06, length is four bytes.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.