Unicode in JavaScript
by Jinya
"Reprint please indicate the source, Http://blog.csdn.net/EI__Nino"
Noun Explanation:
BMP: (basicmultilingual Plane) It is also referred to as "0th plane", Plane 0
UCS: Universal Character Set (Universal Character set, UCS)
ISO: International Organization for Standardization (ISO)
Utf:ucs Transformation Format,
Bom:byte Order Mark byte order
CJK: Unified Ideographic Symbol (CJK Unified ideographs)
Be:big Endian Big-endian
Le:little Endian Small End
First, Introduction
Unicode (Uniform Code, universal Code, single code) is a character encoding used on a computer. Unicode is created to address the limitations of traditional character encoding schemes, which set a uniform and unique binary encoding for each character in each language to meet the requirements of cross-language, cross-platform text conversion and processing. Research and development began in 1990, officially announced in 1994.
Second, UCS
The universal Character set (Universalcharacter set, UCS) is a standard character set defined by ISO 10646 (or ISO/IEC 10646) standards. The UCS-2 is encoded in two bytes, and the UCS-4 is encoded in 4 bytes.
The UCS-4 is divided into 27 = 128 groups According to the highest byte maximum of 0. Each group is then divided into 256 planes (plane) based on the sub-high byte. Each plane is divided into 256 rows (row) According to the 3rd byte, with 256 code bits (cells) per line. The plane 0 of group 0 is called BMP (Basic multilingualplane). If the first two bytes of the UCS-4 are all zeros, then the UCS-4 bmp is removed by removing the previous two 0 bytes UCS-2.
Third, Unicode
The Unicode standard prepares all characters of the Kangxi Dictionary into Unicode 32bit encoding.
Unicode extended from ASCII character set. In strict ASCII, each character is represented by a 7-bit element, or a 8-bit width is commonly used on a computer, whereas Unicode uses a full 16-bit character set. This enables Unicode to represent characters, hieroglyphs, and other symbols that may be used in computer communication in all the written languages of the world. Unicode was originally intended as a complement to ASCII and, if possible, would eventually replace it. Given that ASCII is the most dominant criterion in a computer, it is a very high goal indeed.
Unicode affects every part of the computer industry, but it may have the greatest impact on operating systems and programming languages. From this point of view, we are already on the road. Windows NT supports Unicode from the bottom level (unfortunately, Windows 98 is only a small subset of Unicode support). The C programming language, which is inherently ANSI-bound, supports Unicode through support for wide-character sets.
Iv. UTF-8
Bytes ff and Fe never appear in UTF-8 encoding, so they can be used to indicate that UTF-16 or UTF-32 text (see BOM) UTF-8 is byte-order independent.
UTF-8 encodes Unicode in bytes. The encoding from Unicode to UTF-8 is as follows:
Unicode encoding (hexadecimal) |
UTF-8 byte stream (binary) |
000000-00007f |
0xxxxxxx (7x) |
000080-0007ff |
110xxxxx 10xxxxxx (11x) |
000800-00ffff |
1110xxxx 10xxxxxx 10xxxxxx (16x) |
010000-10ffff |
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (21x) |
UTF-8 is characterized by the use of different lengths of encoding for different ranges of characters. For The characters between 0x00-0x7f, in the 0 -number plane,BMP,UTF-8 encoded with ASCII The encoding is exactly the same.
"\x32" 2, " \u0032"
The maximum length of a UTF-8 encoding is 4 bytes. As can be seen from the table above, the 4-byte template has 21 X, which can hold 21-bit binary digits.
The Unicode maximum code bit 0X10FFFF is also only 21 bits.
Example 1: The Unicode encoding of the word "Han" is 0x6c49. 0x6c49 between 0X0800-0XFFFF, using a 3-byte template: 1110xxxx 10xxxxxx 10xxxxxx.
The 0x6c49 is written in binary is:0110 1100 1001, with this bit stream in turn instead of the template x, get: 11100110 Ten110001 001001, namely E6 B1 89.
encodeURI ("Han") "%e6%b1%89"
Example 2:unicode encoding 0x20c30 between 0X010000-0X10FFFF, using a 4-byte template: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. Write the 0X20C30 as a 21-bit binary number (less than 21 bits on the front 0): 0 0010 0000 1100 0011 0000, using this bitstream in turn instead of the X in the template, get: 11110000 10100000 10110000 10110000, i.e. F0 A0 B0 B0.
V. The love between Javascript and Unicode
You can use the String.fromCharCode method to convert any number of numbers into a string
\u plus a hexadecimal number can be converted to a string
String.fromCharCode ("0x4e01") "Ding", 0x4e01.tostring (Ten) "19969", String.fromCharCode (19969) " Ding", "0x4e01". ToString () "0x4e01", "\u4e01". ToString () " Ding", "\u4e01". ToString ()
Vi.. Output
Output "\U4E01"
->eval (' "\u4e01" ') "ding" = = "\u4e01"-"Ding"->eval (' "Ding" ') ->eval (' "\\u4e01" ') "ding" = = "\\u4e01"- > "\u4e01"->eval (' \u4e01 '), eval (' "\\\u4e01" ') "ding" = = "\\\u4e01", "\ Ding"->eval (' "\ Ding")- > eval (' "\\\\u4e01" ') "\u4e01" = = "\\\\u4e01", "\\u4e01"->eval (' \\u4e01 '), ' \u4e01 ' "ding "\\u4e01", " \u4e01", "\\\u4e01" "\ Ding", "\\\\u4e01" "\\U4E01"
Output "\ Ding"
, "\ Ding" "ding", "\ Ding" "\ Ding", "\\\ Ding"
Vii. BOM
There are two kinds of byte order, namely "Big Endian, be" and "Small End" (Little Endian, LE).
Depending on the byte order, the UTF-16 can be implemented as Utf-16le or utf-16be,utf-32 can be implemented as Utf-32le or UTF-32BE. For example:
Unicode Coding |
Utf-16le |
Utf-16be |
Utf32-le |
utf32-be |
0x006c49 |
49 6 C |
6c |
49 6C xx |
00 6C |
0x020c30 |
Approx. D8 DC |
D8 DC 30 |
0C 02 00 |
0C 30 |
The Unicode standard recommends using the BOM (byte order Mark) to differentiate the byte order, that is, the character "0 wide non-disruptive space" that is used as the BOM before transmitting the stream of bytes. The encoding of this character is Feff, and the reverse Fffe (UTF-16) and FFFE0000 (UTF-32) are undefined code bits in Unicode and should not appear in the actual transmission.
The following table is a variety of UTF-encoded BOMs:
UTF Coding |
Byte Order Mark (BOM) |
UTF-8 without BOM |
No |
UTF-8 with BOM |
EF BB BF |
Utf-16le |
FF FE |
Utf-16be |
FE FF |
Utf-32le |
FF FE 00 00 |
Utf-32be |
XX-FE FF |
Viii. Discussion
Why does Chinese account for 3 bytes?
Chinese range 4E00-9FBF:CJK Unified ideographic symbol (CJK Unified ideographs)
The UTF-8 binary representation within the Unicode encoding 000800-00ffff is: 1110xxxx 10xxxxxx 10xxxxxx.
English is represented in ASCII, whereas ASCII-encoded representations are identical to UTF-8 encodings. Their range is between the 0x00-0x7f.
The UTF-8 binary representation in Unicode encoding 000000-00007f is: 0xxxxxxx.
Get Chinese randomly?
0x4e00.tostring () 19968-> 0x9fbf.tostring (Ten) 40895-> 40895-19968 20927 String.fromCharCode (19968+math.round (Math.random () *20927)
Do you need BOM header?
Its byte order is the same in all systems, so it doesn't actually need a BOM. However, PHP, the session before the creation of the need for no output, so the general Utf-8 encoded PHP files are to remove the BOM header
are Unicode and JavaScript representations in HTML the same way?
"" "+ Unicode number to get corresponding characters
document.write ("& #x4e01;") +-Ding
document.write ("& #19968;") +-Ding
How do I find Chinese?
"Da, Memda". Match (/[\U4E00-\U9FBF]/IMG) ["", "", "" da "]
Length?
"" ". Length 1--" \u4e01 ". length
How to obtain Unicode encoding from kanji?
. charCodeAt (0). toString (+) "4e48", var a = "Da, meme", A.replace (/[\u4e00-\u9fbf]/img, Function ($) {return "\\u" +$.charcodeat (0). toString (+);}) " \u4e48\u4e48\u54d2,meme "- parseint (encodeURI (" D "). Split ("% "). Slice (1). Map (function (v) {return parseint (V , (+). ToString (2). Replace (/^1*0/, ""); }). Join (""), 2). toString (16)
Reference:
Http://www.cnblogs.com/ecalf/archive/2012/09/04/unicode.html
Http://baike.baidu.com/link?url=4mS6twL-TXFMhtFZ8tnP9luwk-HoNQKf7sGA8KiIKI-K0Pkd0K3iRtbe0scF0BP4QGFp2b4EqqzPrJU5R24e1a
Https://github.com/chenjinya/matrix
Unicode in JavaScript