Unicode in JavaScript, unicodejavascript

Source: Internet
Author: User

Unicode in JavaScript, unicodejavascript

Unicode in JavaScript

By Jinya

[For more information, see http://blog.csdn.net/ei1_nino]

Glossary:

 

BMP :( BasicMultilingual Plane) It is also referred to as "Zero plane", Plane 0

 

UCS: Universal Character Set (UCS)

 

ISO: International Organization for Standardization (ISO)

 

UTF: UCS Transformation Format,

 

BOM: Byte Order Mark Byte

 

CJK: Unified ideographic symbols (CJK uniied Ideographs)

 

BE: Big Endian

 

LE: Little Endian

 

 

 

I. Introduction

 

Unicode (unified code, universal code, Single Code) is a character encoding used on a computer. Unicode is generated to address the limitations of traditional character encoding schemes. It sets a uniform and unique binary encoding for each character in each language, to meet the requirements of cross-language and cross-platform text conversion and processing. R & D started in December 1990 and officially announced in December 1994.

 

Ii. UCS

 

The UniversalCharacter Set is a standard character Set defined by the ISO 10646 (or ISO/IEC 10646) standard. The UCS-2 is encoded in two bytes, And the UCS-4 is 4 bytes encoded.

 

The UCS-4 is divided into 27 = 128 groups based on the highest byte with the highest bit of 0. Each group is further divided into 256 planes based on the next high byte ). Each plane is divided into 3rd rows based on 256 bytes, and each row has 256 cells ). The plane 0 of group 0 is called BMP (Basic MultilingualPlane ). If the first two bytes of the UCS-4 are all zero, remove the bmp of the UCS-4 from the first two bytes to get the UCS-2.

 

Iii. Unicode

 

The Unicode standard is used to put all Chinese characters in the Kangxi Dictionary into the Unicode 32bit encoding.

 

Unicode is extended from the ASCII character set. In strict ASCII, each character is represented in 7-bit yuan, or each character commonly used on the computer has 8-bit yuan width, while Unicode uses a full 16-bit yuan character set. This enables Unicode to represent characters, hieroglyphics, and other symbols that may be used for computer communications in all writing languages in the world. Unicode was originally intended to be supplemented with ASCII, and will eventually replace it if possible. Considering that ASCII is the most dominant standard in computers, it is indeed a high goal.

Unicode affects every part of the computer industry, but it may have the greatest impact on the operating system and programming language. From this perspective, we are on the road. Windows NT supports Unicode from the underlying level (unfortunately, Windows 98 only supports Unicode in a small part ). The C programming language, which is inherently bound by ANSI, supports Unicode by providing support for the wide-text metaset.

4. UTF-8

 

Byte FF and FE will never appear in UTF-8 encoding, so they can be used to indicate that UTF-16 or UTF-32 text (see BOM) UTF-8 is bytes order independent.

 

The UTF-8 encodes Unicode in bytes. The encoding method from Unicode to UTF-8 is as follows:

Unicode encoding (hexadecimal)

UTF-8 byte stream (Binary)

000000-00007F

0 xxxxxxx (7x)

000080-0007FF

110 xxxxx 10 xxxxxx (11x)

000800-00 FFFF

1110 xxxx 10 xxxxxx 10 xxxxxx (16x)

010000-10 FFFF

11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx (21x)

 

The UTF-8 is characterized by the use of different length encoding for characters in different ranges. For characters between 0x00-0x7F, In the 0 plane, BMP, UTF-8 encoding and ASCII encoding are exactly the same.

 

-> “\x32"   2 -> "\u0032"   "2" 

The maximum length of a UTF-8 encoding is 4 bytes. From the table above, we can see that the 4-byte template has 21 x, which can hold 21 binary numbers.

The maximum size of Unicode is 0x10FFFF, which is only 21 characters.

Example 1: The Unicode code of the Chinese character is 0x6C49. 0x6C49 is between 0x0800-0xFFFF and uses a 3-byte template: 1110 xxxx 10 xxxxxx 10 xxxxxx.

Write 0x6C49 as binary:0110 1100 0100 1001, Replace x in the template with this bit stream in sequence, and get: 111001101011000110001001That is, E6 B1 89.

 

-> EncodeURI ("Han") "% E6 % B1 % 89"

 

Example 2: Unicode code 0x20C30 is between 0x00000-0x10ffff. The 4-byte template is used: 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx. Write 0x20C30 as a 21-bit binary number (if less than 21 bits are filled with 0): 0 0010 0000 1100 0011. Use this bit stream to replace x in the template in sequence: 11110000 10100000 10110000 10110000, that is, F0 A0 B0 B0.

 

5. The love between Javascript and Unicode

 

You can use the String. fromCharCode method to convert any hexadecimal number to a String.

\ U plus a hexadecimal number can be converted into a string

 

 

-> String. fromCharCode ("0x4e01") "ding"-> 0x4e01. toString (10) "19969"-> String. fromCharCode (19969) "ding"-> "0x4e01 ". toString () "0x4e01"-> "\ u4e01 ". toString () "ding"-> "\ u4e01 ". toString () "ding"

 
Vi. Output

 

Output "\ u4e01"

 

-> Eval ('"\ u4e01"') "ding" = "\ u4e01"-> "ding"-> eval ('"ding "') -> eval ('"\ u4e01"') "ding" = "\ u4e01"-> "\ u4e01"-> eval ('"\ u4e01 "') -> eval ('"\ u4e01 "') "ding" = "\ u4e01"-> "\ ding"-> eval ('"\ ding "') -> eval ('"\\\\ u4e01 "') "\ u4e01" = "\\\\ u4e01"-> "\ u4e01"-> eval ('"\ u4e01 "') -> '\ u4e01' "ding"-> '\ u4e01' "\ u4e01"-> '\ u4e01' "\ ding"-> '\ u4e01 '"\ u4e01"

 

Output "\ ding"

-> "\ Ding" "ding"-> "\ ding" "\ ding"-> "\ ding" "\ ding"


 

VII. BOM

 

There are two types of byte order: "Big Endian" (BE) and "Little Endian" (LE ).

Depending on the order of bytes, A UTF-16 can be implemented as a UTF-16LE or a UTF-16BE that can be implemented as a UTF-32 or a UTF-32LE. For example:

Unicode encoding

 

UTF-16LE

UTF-16BE

 

UTF32-LE

UTF32-BE

 

0x006C49

49 6C

 

6C 49

 

49 6C 00 00

 

00 00 6C 49

 

0x020C30

 

43 D8 30 DC

 

D8 43 DC 30

 

30 0C 02 00

 

00 02 0C 30

 

We recommend that you use BOM (Byte Order Mark) to distinguish the Byte Order. That is, before transmitting a Byte stream, the BOM character "Zero Width, no interrupt space" is transmitted ". The character encoding is FEFF, and the reverse FFFE (UTF-16) and FFFE0000 (UTF-32) are undefined bitwise in Unicode and should not appear in actual transmission.

The following table lists the BOM of various UTF codes:

UTF Encoding

Byte Order Mark (BOM)

 

UTF-8 without BOM.

None

UTF-8 with BOM

EF BB BF

UTF-16LE

FF FE

UTF-16BE

FE FF

UTF-32LE

Ff fe 00 00

UTF-32BE

00 00 FE FF

 

8. Discussion

 

Why does Chinese take 3 bytes?

 

4E00-9FBF: CJK Unified ideographic symbol (CJK uniied Ideographs)

The UTF-8 binary in unicode encoding 000800-00 FFFF is: 1110 xxxx 10 xxxxxx 10 xxxxxx.

 

English is represented in ASCII, while the representation of ASCII encoding is exactly the same as that of UTF-8 encoding. Their range is between 0x00-0x7F.

The UTF-8 binary in unicode encoding 000000-00007F is expressed as: 0 xxxxxxx.

 

 

Randomly retrieve Chinese characters?

 

-> 0x4e00.toString(10)    19968-> 0x9FBF.toString(10)       40895->  40895-19968     20927    String.fromCharCode(19968+Math.round(Math.random()*20927)


 

Do you need a BOM header?

 

Its byte order is the same in all systems, so it does not actually need BOM. However, in PHP, no output is required before the session is created. Therefore, the Bom header must be removed from the PHP file encoded with UTF-8.

 

Is Unicode in HTML the same as that in javascript?

 

"& #" + Unicode number to get the corresponding character

 

Document. write ("& # x4e01;") => ding

Document. write ("& #19968;") => ding

 

How to find Chinese characters?

-> "Memda, memda". match (/[\ u4e00-\ u9FBF]/img) ["do", "do", "da"]
 

 

Length?

-> "What". length 1-> "\ u4e01". length 1

How can I get unicode encoding from Chinese characters?

-> "Do ". charCodeAt (0 ). toString (16) "4e48"-> var a = "what, meme"->. replace (/[\ u4e00-\ u9fbf]/img, function ($) {return "\ u" + $. charCodeAt (0 ). toString (16);}) "\ u4e48 \ u4e48 \ u54d2, meme"-> parseInt (encodeURI ("ding "). split ("% "). slice (1 ). map (function (v) {return parseInt (v, 16 ). toString (2 ). replace (/^ 1*0 /,"");}). join (""), 2 ). toString (16) "4e01" = encodeURI ("ding") => "% E4 % B8 % 81" => ["E4", "B8 ", "81"] => ["0100", "111000", "000001"] => "0100111000000001" => 19969 => "4e01"

 

 

Refer:

Http://www.cnblogs.com/ecalf/archive/2012/09/04/unicode.html

Http://baike.baidu.com/link? Url = Response

 

Https://github.com/chenjinya/matrix

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.