Unicode in JavaScript

Source: Internet
Author: User

Unicode in JavaScript

by Jinya

"Reprint please indicate the source, Http://blog.csdn.net/EI__Nino"

Noun Explanation:

BMP: (basicmultilingual Plane) It is also referred to as "0th plane", Plane 0

UCS: Universal Character Set (Universal Character set, UCS)

ISO: International Organization for Standardization (ISO)

Utf:ucs Transformation Format,

Bom:byte Order Mark byte order

CJK: Unified Ideographic Symbol (CJK Unified ideographs)

Be:big Endian Big-endian

Le:little Endian Small End

First, Introduction

Unicode (Uniform Code, universal Code, single code) is a character encoding used on a computer. Unicode is created to address the limitations of traditional character encoding schemes, which set a uniform and unique binary encoding for each character in each language to meet the requirements of cross-language, cross-platform text conversion and processing. Research and development began in 1990, officially announced in 1994.

Second, UCS

The universal Character set (Universalcharacter set, UCS) is a standard character set defined by ISO 10646 (or ISO/IEC 10646) standards. The UCS-2 is encoded in two bytes, and the UCS-4 is encoded in 4 bytes.

The UCS-4 is divided into 27 = 128 groups According to the highest byte maximum of 0. Each group is then divided into 256 planes (plane) based on the sub-high byte. Each plane is divided into 256 rows (row) According to the 3rd byte, with 256 code bits (cells) per line. The plane 0 of group 0 is called BMP (Basic multilingualplane). If the first two bytes of the UCS-4 are all zeros, then the UCS-4 bmp is removed by removing the previous two 0 bytes UCS-2.

Third, Unicode

The Unicode standard prepares all characters of the Kangxi Dictionary into Unicode 32bit encoding.

Unicode extended from ASCII character set. In strict ASCII, each character is represented by a 7-bit element, or a 8-bit width is commonly used on a computer, whereas Unicode uses a full 16-bit character set. This enables Unicode to represent characters, hieroglyphs, and other symbols that may be used in computer communication in all the written languages of the world. Unicode was originally intended as a complement to ASCII and, if possible, would eventually replace it. Given that ASCII is the most dominant criterion in a computer, it is a very high goal indeed.

Unicode affects every part of the computer industry, but it may have the greatest impact on operating systems and programming languages. From this point of view, we are already on the road. Windows NT supports Unicode from the bottom level (unfortunately, Windows 98 is only a small subset of Unicode support). The C programming language, which is inherently ANSI-bound, supports Unicode through support for wide-character sets.

Iv. UTF-8

Bytes ff and Fe never appear in UTF-8 encoding, so they can be used to indicate that UTF-16 or UTF-32 text (see BOM) UTF-8 is byte-order independent.

UTF-8 encodes Unicode in bytes. The encoding from Unicode to UTF-8 is as follows:

Unicode encoding (hexadecimal)

UTF-8 byte stream (binary)

000000-00007f

0xxxxxxx (7x)

000080-0007ff

110xxxxx 10xxxxxx (11x)

000800-00ffff

1110xxxx 10xxxxxx 10xxxxxx (16x)

010000-10ffff

11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (21x)

UTF-8 is characterized by the use of different lengths of encoding for different ranges of characters. For The characters between 0x00-0x7f, in the 0 -number plane,BMP,UTF-8 encoded with ASCII The encoding is exactly the same.

"\x32" 2, "   \u0032"   

The maximum length of a UTF-8 encoding is 4 bytes. As can be seen from the table above, the 4-byte template has 21 X, which can hold 21-bit binary digits.

The Unicode maximum code bit 0X10FFFF is also only 21 bits.

Example 1: The Unicode encoding of the word "Han" is 0x6c49. 0x6c49 between 0X0800-0XFFFF, using a 3-byte template: 1110xxxx 10xxxxxx 10xxxxxx.

The 0x6c49 is written in binary is:0110 1100 1001, with this bit stream in turn instead of the template x, get: 11100110 Ten110001 001001, namely E6 B1 89.

encodeURI ("Han")   "%e6%b1%89"

Example 2:unicode encoding 0x20c30 between 0X010000-0X10FFFF, using a 4-byte template: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx. Write the 0X20C30 as a 21-bit binary number (less than 21 bits on the front 0): 0 0010 0000 1100 0011 0000, using this bitstream in turn instead of the X in the template, get: 11110000 10100000 10110000 10110000, i.e. F0 A0 B0 B0.

V. The love between Javascript and Unicode

You can use the String.fromCharCode method to convert any number of numbers into a string

\u plus a hexadecimal number can be converted to a string

String.fromCharCode ("0x4e01")    "Ding", 0x4e01.tostring (Ten)   "19969", String.fromCharCode (19969) "    Ding", "0x4e01". ToString ()   "0x4e01", "\u4e01". ToString () "    Ding", "\u4e01". ToString ()    


Vi.. Output

Output "\U4E01"

->eval (' "\u4e01" ')    "ding" = = "\u4e01"-"Ding"->eval (' "Ding" ')  ->eval (' "\\u4e01" ')    "ding" = = "\\u4e01"- > "\u4e01"->eval (' \u4e01 '),  eval (' "\\\u4e01" ')    "ding" = = "\\\u4e01", "\ Ding"->eval (' "\ Ding")- > eval (' "\\\\u4e01" ')    "\u4e01" = = "\\\\u4e01", "\\u4e01"->eval (' \\u4e01 '),   ' \u4e01 '    "ding "\\u4e01", "   \u4e01", "\\\u4e01"    "\ Ding", "\\\\u4e01"   "\\U4E01"

Output "\ Ding"

, "\ Ding"    "ding", "\ Ding"    "\ Ding", "\\\ Ding"    


Vii. BOM

There are two kinds of byte order, namely "Big Endian, be" and "Small End" (Little Endian, LE).

Depending on the byte order, the UTF-16 can be implemented as Utf-16le or utf-16be,utf-32 can be implemented as Utf-32le or UTF-32BE. For example:

Unicode Coding

Utf-16le

Utf-16be

Utf32-le

utf32-be

0x006c49

49 6 C

 

6c

 

49 6C xx

 

00 6C

 

0x020c30

Approx. D8 DC

D8 DC 30

0C 02 00

0C 30

The Unicode standard recommends using the BOM (byte order Mark) to differentiate the byte order, that is, the character "0 wide non-disruptive space" that is used as the BOM before transmitting the stream of bytes. The encoding of this character is Feff, and the reverse Fffe (UTF-16) and FFFE0000 (UTF-32) are undefined code bits in Unicode and should not appear in the actual transmission.

The following table is a variety of UTF-encoded BOMs:

UTF Coding

Byte Order Mark (BOM)

UTF-8 without BOM

No

UTF-8 with BOM

EF BB BF

Utf-16le

FF FE

Utf-16be

FE FF

Utf-32le

FF FE 00 00

Utf-32be

XX-FE FF

Viii. Discussion

Why does Chinese account for 3 bytes?

Chinese range 4E00-9FBF:CJK Unified ideographic symbol (CJK Unified ideographs)

The UTF-8 binary representation within the Unicode encoding 000800-00ffff is: 1110xxxx 10xxxxxx 10xxxxxx.

English is represented in ASCII, whereas ASCII-encoded representations are identical to UTF-8 encodings. Their range is between the 0x00-0x7f.

The UTF-8 binary representation in Unicode encoding 000000-00007f is: 0xxxxxxx.

Get Chinese randomly?

0x4e00.tostring ()    19968-> 0x9fbf.tostring (Ten)       40895->  40895-19968     20927    String.fromCharCode (19968+math.round (Math.random () *20927)


Do you need BOM header?

Its byte order is the same in all systems, so it doesn't actually need a BOM. However, PHP, the session before the creation of the need for no output, so the general Utf-8 encoded PHP files are to remove the BOM header

are Unicode and JavaScript representations in HTML the same way?

"" "+ Unicode number to get corresponding characters

document.write ("& #x4e01;") +-Ding

document.write ("& #19968;") +-Ding

How do I find Chinese?

"Da, Memda". Match (/[\U4E00-\U9FBF]/IMG) ["", "", ""    da "]
 

Length?

"" ". Length 1--"    \u4e01 ". length    

How to obtain Unicode encoding from kanji?

. charCodeAt (0). toString (+)    "4e48",  var a = "Da, meme", A.replace (/[\u4e00-\u9fbf]/img, Function ($) {return "\\u" +$.charcodeat (0). toString (+);})    " \u4e48\u4e48\u54d2,meme "-  parseint (encodeURI (" D "). Split ("% "). Slice (1). Map (function (v) {return parseint (V , (+). ToString (2). Replace (/^1*0/, ""); }). Join (""), 2). toString (16)   

Reference:

Http://www.cnblogs.com/ecalf/archive/2012/09/04/unicode.html

Http://baike.baidu.com/link?url=4mS6twL-TXFMhtFZ8tnP9luwk-HoNQKf7sGA8KiIKI-K0Pkd0K3iRtbe0scF0BP4QGFp2b4EqqzPrJU5R24e1a

Https://github.com/chenjinya/matrix

Unicode in JavaScript

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.