Unicode in JavaScript, unicodejavascript
Unicode in JavaScript
By Jinya
[For more information, see http://blog.csdn.net/ei1_nino]
Glossary:
BMP :( BasicMultilingual Plane) It is also referred to as "Zero plane", Plane 0
UCS: Universal Character Set (UCS)
ISO: International Organization for Standardization (ISO)
UTF: UCS Transformation Format,
BOM: Byte Order Mark Byte
CJK: Unified ideographic symbols (CJK uniied Ideographs)
BE: Big Endian
LE: Little Endian
I. Introduction
Unicode (unified code, universal code, Single Code) is a character encoding used on a computer. Unicode is generated to address the limitations of traditional character encoding schemes. It sets a uniform and unique binary encoding for each character in each language, to meet the requirements of cross-language and cross-platform text conversion and processing. R & D started in December 1990 and officially announced in December 1994.
Ii. UCS
The UniversalCharacter Set is a standard character Set defined by the ISO 10646 (or ISO/IEC 10646) standard. The UCS-2 is encoded in two bytes, And the UCS-4 is 4 bytes encoded.
The UCS-4 is divided into 27 = 128 groups based on the highest byte with the highest bit of 0. Each group is further divided into 256 planes based on the next high byte ). Each plane is divided into 3rd rows based on 256 bytes, and each row has 256 cells ). The plane 0 of group 0 is called BMP (Basic MultilingualPlane ). If the first two bytes of the UCS-4 are all zero, remove the bmp of the UCS-4 from the first two bytes to get the UCS-2.
Iii. Unicode
The Unicode standard is used to put all Chinese characters in the Kangxi Dictionary into the Unicode 32bit encoding.
Unicode is extended from the ASCII character set. In strict ASCII, each character is represented in 7-bit yuan, or each character commonly used on the computer has 8-bit yuan width, while Unicode uses a full 16-bit yuan character set. This enables Unicode to represent characters, hieroglyphics, and other symbols that may be used for computer communications in all writing languages in the world. Unicode was originally intended to be supplemented with ASCII, and will eventually replace it if possible. Considering that ASCII is the most dominant standard in computers, it is indeed a high goal.
Unicode affects every part of the computer industry, but it may have the greatest impact on the operating system and programming language. From this perspective, we are on the road. Windows NT supports Unicode from the underlying level (unfortunately, Windows 98 only supports Unicode in a small part ). The C programming language, which is inherently bound by ANSI, supports Unicode by providing support for the wide-text metaset.
4. UTF-8
Byte FF and FE will never appear in UTF-8 encoding, so they can be used to indicate that UTF-16 or UTF-32 text (see BOM) UTF-8 is bytes order independent.
The UTF-8 encodes Unicode in bytes. The encoding method from Unicode to UTF-8 is as follows:
Unicode encoding (hexadecimal) |
UTF-8 byte stream (Binary) |
000000-00007F |
0 xxxxxxx (7x) |
000080-0007FF |
110 xxxxx 10 xxxxxx (11x) |
000800-00 FFFF |
1110 xxxx 10 xxxxxx 10 xxxxxx (16x) |
010000-10 FFFF |
11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx (21x) |
The UTF-8 is characterized by the use of different length encoding for characters in different ranges. For characters between 0x00-0x7F, In the 0 plane, BMP, UTF-8 encoding and ASCII encoding are exactly the same.
-> “\x32" 2 -> "\u0032" "2"
The maximum length of a UTF-8 encoding is 4 bytes. From the table above, we can see that the 4-byte template has 21 x, which can hold 21 binary numbers.
The maximum size of Unicode is 0x10FFFF, which is only 21 characters.
Example 1: The Unicode code of the Chinese character is 0x6C49. 0x6C49 is between 0x0800-0xFFFF and uses a 3-byte template: 1110 xxxx 10 xxxxxx 10 xxxxxx.
Write 0x6C49 as binary:0110 1100 0100 1001, Replace x in the template with this bit stream in sequence, and get: 111001101011000110001001That is, E6 B1 89.
-> EncodeURI ("Han") "% E6 % B1 % 89"
Example 2: Unicode code 0x20C30 is between 0x00000-0x10ffff. The 4-byte template is used: 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx. Write 0x20C30 as a 21-bit binary number (if less than 21 bits are filled with 0): 0 0010 0000 1100 0011. Use this bit stream to replace x in the template in sequence: 11110000 10100000 10110000 10110000, that is, F0 A0 B0 B0.
5. The love between Javascript and Unicode
You can use the String. fromCharCode method to convert any hexadecimal number to a String.
\ U plus a hexadecimal number can be converted into a string
-> String. fromCharCode ("0x4e01") "ding"-> 0x4e01. toString (10) "19969"-> String. fromCharCode (19969) "ding"-> "0x4e01 ". toString () "0x4e01"-> "\ u4e01 ". toString () "ding"-> "\ u4e01 ". toString () "ding"
Vi. Output
Output "\ u4e01"
-> Eval ('"\ u4e01"') "ding" = "\ u4e01"-> "ding"-> eval ('"ding "') -> eval ('"\ u4e01"') "ding" = "\ u4e01"-> "\ u4e01"-> eval ('"\ u4e01 "') -> eval ('"\ u4e01 "') "ding" = "\ u4e01"-> "\ ding"-> eval ('"\ ding "') -> eval ('"\\\\ u4e01 "') "\ u4e01" = "\\\\ u4e01"-> "\ u4e01"-> eval ('"\ u4e01 "') -> '\ u4e01' "ding"-> '\ u4e01' "\ u4e01"-> '\ u4e01' "\ ding"-> '\ u4e01 '"\ u4e01"
Output "\ ding"
-> "\ Ding" "ding"-> "\ ding" "\ ding"-> "\ ding" "\ ding"
VII. BOM
There are two types of byte order: "Big Endian" (BE) and "Little Endian" (LE ).
Depending on the order of bytes, A UTF-16 can be implemented as a UTF-16LE or a UTF-16BE that can be implemented as a UTF-32 or a UTF-32LE. For example:
Unicode encoding |
UTF-16LE |
UTF-16BE |
UTF32-LE |
UTF32-BE |
0x006C49 |
49 6C |
6C 49 |
49 6C 00 00 |
00 00 6C 49 |
0x020C30 |
43 D8 30 DC |
D8 43 DC 30 |
30 0C 02 00 |
00 02 0C 30 |
We recommend that you use BOM (Byte Order Mark) to distinguish the Byte Order. That is, before transmitting a Byte stream, the BOM character "Zero Width, no interrupt space" is transmitted ". The character encoding is FEFF, and the reverse FFFE (UTF-16) and FFFE0000 (UTF-32) are undefined bitwise in Unicode and should not appear in actual transmission.
The following table lists the BOM of various UTF codes:
UTF Encoding |
Byte Order Mark (BOM) |
UTF-8 without BOM. |
None |
UTF-8 with BOM |
EF BB BF |
UTF-16LE |
FF FE |
UTF-16BE |
FE FF |
UTF-32LE |
Ff fe 00 00 |
UTF-32BE |
00 00 FE FF |
8. Discussion
Why does Chinese take 3 bytes?
4E00-9FBF: CJK Unified ideographic symbol (CJK uniied Ideographs)
The UTF-8 binary in unicode encoding 000800-00 FFFF is: 1110 xxxx 10 xxxxxx 10 xxxxxx.
English is represented in ASCII, while the representation of ASCII encoding is exactly the same as that of UTF-8 encoding. Their range is between 0x00-0x7F.
The UTF-8 binary in unicode encoding 000000-00007F is expressed as: 0 xxxxxxx.
Randomly retrieve Chinese characters?
-> 0x4e00.toString(10) 19968-> 0x9FBF.toString(10) 40895-> 40895-19968 20927 String.fromCharCode(19968+Math.round(Math.random()*20927)
Do you need a BOM header?
Its byte order is the same in all systems, so it does not actually need BOM. However, in PHP, no output is required before the session is created. Therefore, the Bom header must be removed from the PHP file encoded with UTF-8.
Is Unicode in HTML the same as that in javascript?
"& #" + Unicode number to get the corresponding character
Document. write ("& # x4e01;") => ding
Document. write ("& #19968;") => ding
How to find Chinese characters?
-> "Memda, memda". match (/[\ u4e00-\ u9FBF]/img) ["do", "do", "da"]
Length?
-> "What". length 1-> "\ u4e01". length 1
How can I get unicode encoding from Chinese characters?
-> "Do ". charCodeAt (0 ). toString (16) "4e48"-> var a = "what, meme"->. replace (/[\ u4e00-\ u9fbf]/img, function ($) {return "\ u" + $. charCodeAt (0 ). toString (16);}) "\ u4e48 \ u4e48 \ u54d2, meme"-> parseInt (encodeURI ("ding "). split ("% "). slice (1 ). map (function (v) {return parseInt (v, 16 ). toString (2 ). replace (/^ 1*0 /,"");}). join (""), 2 ). toString (16) "4e01" = encodeURI ("ding") => "% E4 % B8 % 81" => ["E4", "B8 ", "81"] => ["0100", "111000", "000001"] => "0100111000000001" => 19969 => "4e01"
Refer:
Http://www.cnblogs.com/ecalf/archive/2012/09/04/unicode.html
Http://baike.baidu.com/link? Url = Response
Https://github.com/chenjinya/matrix