Unicode in JavaScript, unicodejavascript

Last Update:2015-02-09 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Unicode in JavaScript

By Jinya

[For more information, see http://blog.csdn.net/ei1_nino]

Glossary:

BMP :( BasicMultilingual Plane) It is also referred to as "Zero plane", Plane 0

UCS: Universal Character Set (UCS)

ISO: International Organization for Standardization (ISO)

UTF: UCS Transformation Format,

BOM: Byte Order Mark Byte

CJK: Unified ideographic symbols (CJK uniied Ideographs)

BE: Big Endian

LE: Little Endian

I. Introduction

Unicode (unified code, universal code, Single Code) is a character encoding used on a computer. Unicode is generated to address the limitations of traditional character encoding schemes. It sets a uniform and unique binary encoding for each character in each language, to meet the requirements of cross-language and cross-platform text conversion and processing. R & D started in December 1990 and officially announced in December 1994.

Ii. UCS

The UniversalCharacter Set is a standard character Set defined by the ISO 10646 (or ISO/IEC 10646) standard. The UCS-2 is encoded in two bytes, And the UCS-4 is 4 bytes encoded.

The UCS-4 is divided into 27 = 128 groups based on the highest byte with the highest bit of 0. Each group is further divided into 256 planes based on the next high byte ). Each plane is divided into 3rd rows based on 256 bytes, and each row has 256 cells ). The plane 0 of group 0 is called BMP (Basic MultilingualPlane ). If the first two bytes of the UCS-4 are all zero, remove the bmp of the UCS-4 from the first two bytes to get the UCS-2.

Iii. Unicode

The Unicode standard is used to put all Chinese characters in the Kangxi Dictionary into the Unicode 32bit encoding.

Unicode is extended from the ASCII character set. In strict ASCII, each character is represented in 7-bit yuan, or each character commonly used on the computer has 8-bit yuan width, while Unicode uses a full 16-bit yuan character set. This enables Unicode to represent characters, hieroglyphics, and other symbols that may be used for computer communications in all writing languages in the world. Unicode was originally intended to be supplemented with ASCII, and will eventually replace it if possible. Considering that ASCII is the most dominant standard in computers, it is indeed a high goal.

Unicode affects every part of the computer industry, but it may have the greatest impact on the operating system and programming language. From this perspective, we are on the road. Windows NT supports Unicode from the underlying level (unfortunately, Windows 98 only supports Unicode in a small part ). The C programming language, which is inherently bound by ANSI, supports Unicode by providing support for the wide-text metaset.

4. UTF-8

Byte FF and FE will never appear in UTF-8 encoding, so they can be used to indicate that UTF-16 or UTF-32 text (see BOM) UTF-8 is bytes order independent.

The UTF-8 encodes Unicode in bytes. The encoding method from Unicode to UTF-8 is as follows:

Unicode encoding (hexadecimal)	UTF-8 byte stream (Binary)
000000-00007F	0 xxxxxxx (7x)
000080-0007FF	110 xxxxx 10 xxxxxx (11x)
000800-00 FFFF	1110 xxxx 10 xxxxxx 10 xxxxxx (16x)
010000-10 FFFF	11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx (21x)

The UTF-8 is characterized by the use of different length encoding for characters in different ranges. For characters between 0x00-0x7F, In the 0 plane, BMP, UTF-8 encoding and ASCII encoding are exactly the same.

-> “\x32"   2 -> "\u0032"   "2"

The maximum length of a UTF-8 encoding is 4 bytes. From the table above, we can see that the 4-byte template has 21 x, which can hold 21 binary numbers.

The maximum size of Unicode is 0x10FFFF, which is only 21 characters.

Example 1: The Unicode code of the Chinese character is 0x6C49. 0x6C49 is between 0x0800-0xFFFF and uses a 3-byte template: 1110 xxxx 10 xxxxxx 10 xxxxxx.

Write 0x6C49 as binary:0110 1100 0100 1001, Replace x in the template with this bit stream in sequence, and get: 111001101011000110001001That is, E6 B1 89.

-> EncodeURI ("Han") "% E6 % B1 % 89"

Example 2: Unicode code 0x20C30 is between 0x00000-0x10ffff. The 4-byte template is used: 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx. Write 0x20C30 as a 21-bit binary number (if less than 21 bits are filled with 0): 0 0010 0000 1100 0011. Use this bit stream to replace x in the template in sequence: 11110000 10100000 10110000 10110000, that is, F0 A0 B0 B0.

5. The love between Javascript and Unicode

You can use the String. fromCharCode method to convert any hexadecimal number to a String.

\ U plus a hexadecimal number can be converted into a string

-> String. fromCharCode ("0x4e01") "ding"-> 0x4e01. toString (10) "19969"-> String. fromCharCode (19969) "ding"-> "0x4e01 ". toString () "0x4e01"-> "\ u4e01 ". toString () "ding"-> "\ u4e01 ". toString () "ding"

Vi. Output

Output "\ u4e01"

-> Eval ('"\ u4e01"') "ding" = "\ u4e01"-> "ding"-> eval ('"ding "') -> eval ('"\ u4e01"') "ding" = "\ u4e01"-> "\ u4e01"-> eval ('"\ u4e01 "') -> eval ('"\ u4e01 "') "ding" = "\ u4e01"-> "\ ding"-> eval ('"\ ding "') -> eval ('"\\\\ u4e01 "') "\ u4e01" = "\\\\ u4e01"-> "\ u4e01"-> eval ('"\ u4e01 "') -> '\ u4e01' "ding"-> '\ u4e01' "\ u4e01"-> '\ u4e01' "\ ding"-> '\ u4e01 '"\ u4e01"

Output "\ ding"

-> "\ Ding" "ding"-> "\ ding" "\ ding"-> "\ ding" "\ ding"

VII. BOM

There are two types of byte order: "Big Endian" (BE) and "Little Endian" (LE ).

Depending on the order of bytes, A UTF-16 can be implemented as a UTF-16LE or a UTF-16BE that can be implemented as a UTF-32 or a UTF-32LE. For example:

Unicode encoding

UTF-16LE

UTF-16BE

UTF32-LE

UTF32-BE

0x006C49

49 6C

6C 49

49 6C 00 00

00 00 6C 49

0x020C30

43 D8 30 DC

D8 43 DC 30

30 0C 02 00

00 02 0C 30

We recommend that you use BOM (Byte Order Mark) to distinguish the Byte Order. That is, before transmitting a Byte stream, the BOM character "Zero Width, no interrupt space" is transmitted ". The character encoding is FEFF, and the reverse FFFE (UTF-16) and FFFE0000 (UTF-32) are undefined bitwise in Unicode and should not appear in actual transmission.

The following table lists the BOM of various UTF codes:

UTF Encoding	Byte Order Mark (BOM)
UTF-8 without BOM.	None
UTF-8 with BOM	EF BB BF
UTF-16LE	FF FE
UTF-16BE	FE FF
UTF-32LE	Ff fe 00 00
UTF-32BE	00 00 FE FF

8. Discussion

Why does Chinese take 3 bytes?

4E00-9FBF: CJK Unified ideographic symbol (CJK uniied Ideographs)

The UTF-8 binary in unicode encoding 000800-00 FFFF is: 1110 xxxx 10 xxxxxx 10 xxxxxx.

English is represented in ASCII, while the representation of ASCII encoding is exactly the same as that of UTF-8 encoding. Their range is between 0x00-0x7F.

The UTF-8 binary in unicode encoding 000000-00007F is expressed as: 0 xxxxxxx.

Randomly retrieve Chinese characters?

-> 0x4e00.toString(10)    19968-> 0x9FBF.toString(10)       40895->  40895-19968     20927    String.fromCharCode(19968+Math.round(Math.random()*20927)

Do you need a BOM header?

Its byte order is the same in all systems, so it does not actually need BOM. However, in PHP, no output is required before the session is created. Therefore, the Bom header must be removed from the PHP file encoded with UTF-8.

Is Unicode in HTML the same as that in javascript?

"& #" + Unicode number to get the corresponding character

Document. write ("& # x4e01;") => ding

Document. write ("& #19968;") => ding

How to find Chinese characters?

-> "Memda, memda". match (/[\ u4e00-\ u9FBF]/img) ["do", "do", "da"]

Length?

-> "What". length 1-> "\ u4e01". length 1

How can I get unicode encoding from Chinese characters?

-> "Do ". charCodeAt (0 ). toString (16) "4e48"-> var a = "what, meme"->. replace (/[\ u4e00-\ u9fbf]/img, function ($) {return "\ u" + $. charCodeAt (0 ). toString (16);}) "\ u4e48 \ u4e48 \ u54d2, meme"-> parseInt (encodeURI ("ding "). split ("% "). slice (1 ). map (function (v) {return parseInt (v, 16 ). toString (2 ). replace (/^ 1*0 /,"");}). join (""), 2 ). toString (16) "4e01" = encodeURI ("ding") => "% E4 % B8 % 81" => ["E4", "B8 ", "81"] => ["0100", "111000", "000001"] => "0100111000000001" => 19969 => "4e01"

Refer:

Http://www.cnblogs.com/ecalf/archive/2012/09/04/unicode.html

Http://baike.baidu.com/link? Url = Response

Https://github.com/chenjinya/matrix

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Unicode in JavaScript, unicodejavascript

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Unicode in JavaScript, unicodejavascript

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support