The implementation method of UTF-8 coding through javascript, javascriptutf-8

Last Update:2016-07-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Javascript character set:

Javascript programs are written using the Unicode Character Set. Unicode is a superset of ASCII and Latin-1, and supports almost all languages on the earth. ECMAScript3 requires JavaScript to support Unicode2.1 and later versions, while ECMAScript5 requires Unicode3 and later versions. Therefore,

Javascript programs use Unicode encoding.

UTF-8

UTF-8 (UTF8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode and also a prefix code.

It can be used to represent any character in the Unicode standard, and the first byte in its encoding is still compatible with ASCII, which makes the software that originally processes ASCII characters do not need to or only need to make a few modifications, you can continue to use it. As a result, it has gradually become an application for storing or sending text in emails, webpages, and other applications, with priority in encoding.

At present, most of the websites are using UTF-8 encoding.

Convert the Unicode encoded string generated by javascript into a UTF-8 encoded string

As the title says, the application scenario is very common, for example, when sending a piece of binary to the server, the server requires that the binary content be encoded as a UTF-8. In this case, we must convert the Unicode string of javascript into a UTF-8-encoded string through the program.

Conversion Method

Before conversion, we must understand that the Unicode encoding structure is fixed.

If you do not believe it, try the charCodeAt method of String to see how many bytes the returned charCode occupies.

• One English character and two Chinese characters

However, the length of the UTF-8 encoding structure is determined based on the size of a single character.

The following figure shows how many bytes a single character occupies. The maximum length after a single unicode character encoding is 6 bytes.

• 1 byte: Unicode code 0-127
• 2 bytes: Unicode code 128-2047
• 3 bytes: Unicode code 2048-0 xFFFF
• 4 Bytes: Unicode code 65536-0x1FFFFF
• Five Bytes: Unicode code 0x200000-0x3FFFFFF
• 6 bytes: Unicode code 0x4000000-0x7FFFFFFF

For details, see the image:

Because the Unicode code of English and English characters is 0-127, the length and bytes of English in Unicode and UTF-8 are consistent, only occupies 1 byte. That's why UTF8 is a Unicode superset!

Now we will discuss Chinese characters. Because the unicode code range of Chinese characters is 0x2e80-0x9fff, the length of Chinese Characters in UTF8 is up to 3 bytes.

How can we convert Chinese characters from two Unicode bytes to three UTF-8 bytes?

Suppose I need to convert Chinese characters into UTF-8 Encoding

1. Get the Unicode value of Chinese Characters

Var str = '中'; var charCode = str. charCodeAt (0); console. log (charCode); // => 20013

2. Determine the UTF8 length based on the size.

From the previous step, we obtained the charCode 20013 for the Chinese character ". Then we found that 20013 is located in the range of 2048-0xFFFF, so" in the Chinese character "should occupy three bytes in UTF8.

3. Complement

Since we know that "in Chinese character" requires three bytes, how can we get these three bytes?

This requires the design of the complement code. The specific complement logic is as follows:

Okay, I know you can't understand this picture. Let me talk about it!

The specific complement code is as follows. "x" indicates the vacancy, which is used to fill the space.

• 0 xxxxxxx
• 110 xxxxx 10 xxxxxx
• 1110 xxxx 10 xxxxxx 10 xxxxxx
• 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
• 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
• 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

Warning: Have you found out? The first byte of the complement code is preceded by a few 1 to indicate the total number of bytes of the UTF-8 encoding! UTF-8 Decoding for Unicode is the use of this feature OH ~

Let's give a simple example. Convert the English letter "A" into UTF-8 encoding.

1. The charCode of "A" is 65.
2. 65 is in the range from 0 to, so "A" occupies one byte.
3. In UTF8, the complement of a byte is 0 xxxxxxx, and x indicates the vacancy, which is used for the complement.
4. Convert 65 to binary to get 1000001
5. Add 1000001 to the vacant space of 1xxxxxxx in the order from the front to the back to get 01000001
6. Convert 11000001 to A string to get ""
7. Finally, "A" is UTF-8 encoded and ""

Through this small example, do we verify again that the UTF-8 is a Unicode superset!

Now, let's go back to the Chinese character "中"". We have obtained that the charCode of "中"" is 20013, And the binary value is 01001110 00101101. The details are as follows:

Var code = 20013; code. toString (2); // => 100111000101101 is equivalent to 01001110 00101101

Then, we follow the "A" complement method above to fill in "medium.
Set 01001110 00101101 to 1110 xxxx 10 xxxxxx in the ascending order. 11100100 10111000 10101101 is obtained.

4. Obtain the UTF-8 encoded content.

Through the above steps, we get three UTF-8 bytes in "medium", 11100100 10111000 10101101.

Convert each byte into hexadecimal notation to get 0xE4 0xB8 0xAD;
Then, 0xE4 0xB8 0xAD is the final UTF8 encoding.

We use the buffer of nodejs to verify whether it is correct.

Var buffer = new Buffer ('中'); console. log (buffer. length); // => 3console. log (buffer); // <Buffer e4 b8 ad> // Finally, three bytes 0xe4 0xb8 0xad are obtained.

Because hexadecimal is case-insensitive, is it exactly the same as 0xE4 0xB8 0xAD.

Write the preceding encoding logic to a function.

// Format the string to UTF-8 encoded byte var writeUTF = function (str, isGetBytes) {var back = []; var byteSize = 0; for (var I = 0; I <str. length; I ++) {var code = str. charCodeAt (I); if (0x00 <= code & code <= 0x7f) {byteSize + = 1; back. push (code);} else if (0x80 <= code & code <= 0x7ff) {byteSize + = 2; back. push (192 | (31 & (code> 6); back. push (128 | (63 & code)} else if (0x800 <= code & code <= 0 Xd7ff) | (0xe000 <= code & code <= 0 xffff) {byteSize + = 3; back. push (224 | (15 & (code> 12); back. push (128 | (63 & (code> 6); back. push (128 | (63 & code) }}for (I = 0; I <back. length; I ++) {back [I] & = 0xff;} if (isGetBytes) {return back} if (byteSize <= 0xff) {return [0, byteSize]. concat (back);} else {return [byteSize> 8, byteSize & 0xff]. concat (back) ;}} writeUTF ('中' ); // => [0, 3,228,184,173] // the first two digits indicate the length of the subsequent UTF-8 bytes. Because the length is 3, the first two bytes are '0, 3' // The content is '000000' and the hexadecimal format is '0xe4 0xB8 0xAD'

// Read UTF-8 encoded bytes and specify the Unicode string var readUTF = function (arr) {if (typeof arr = 'string') {return arr ;} var UTF = '', _ arr = this. init (arr); for (var I = 0; I <_ arr. length; I ++) {var one = _ arr [I]. toString (2), v = one. match (/^ 1 +? (? = 0)/); if (v & one. length = 8) {var bytesLength = v [0]. length; var store = _ arr [I]. toString (2 ). slice (7-bytesLength); for (var st = 1; st <bytesLength; st ++) {store ++ = _ arr [st + I]. toString (2 ). slice (2)} UTF + = String. fromCharCode (parseInt (store, 2); I + = bytesLength-1} else {UTF ++ = String. fromCharCode (_ arr [I])} return UTF} readUTF ([0, 3,228,184,173]); => '中'

Another method for UTF8 bytecode after parsing Chinese Characters

Another simple method to convert Chinese characters into UTF8 bytecode is relatively simple, and the browser also provides a method, and this method is always used by everyone. What is it? Is encodeURI. Of course, encodeURIComponent is also acceptable.

That's the method. So how does this method convert a Unicode-Encoded chinese to UTF8 bytecode encoding?

Var str = '中'; var code = encodeURI (str); console. log (code); // => % E4 % B8 % AD

Have you found a character string after escaping, and the content of this character string is the same as the bytecode obtained above ~~~.

Next we will convert % E4 % B8 % AD into a number array.

var codeList = code.split('%');codeList = codeList.map(item => parseInt(item,16));console.log(codeList); // => [228, 184, 173]

It's so simple, you have ~~~

What is the principle of this simple method?

The querystring encoding in the URI is involved here. According to the regulations, querystring in URI must be transmitted according to UTF8 encoding, while JavaScript is Unicode, so the browser provides us with a method, that is, the encodeURI/encodeURIComponent method. This method will be described
Non-English characters (here, why not ?) First convert to the UTF-8 bytecode, and then add a % to splice it. Therefore, we can escape the Chinese character "and get" % E4 % B8 % AD ".

Okay, that's all about the principle. There's nothing else.

However, this method also has a disadvantage, that is, it only escapes non-English characters. Therefore, when we need to format English characters into UTF-8 encoding, this method cannot meet our needs, we also need to escape the English characters.

So what should I do if I want to parse it back? You can use decodeURI/decodeURIComponent.

Var codeList = [228,184,173]; var code = codeList. map (item => '%' + item. toString (16 )). join (''); decodeURI (code); // =>

Now, this article describes the UTF-8 encoding.
Hope to help you understand the principle of UTF-8 coding.

The above is xiaobian for everyone to bring through javascript UTF-8 coding implementation method all content, I hope you can support a lot of help house ~

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The implementation method of UTF-8 coding through javascript, javascriptutf-8

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The implementation method of UTF-8 coding through javascript, javascriptutf-8

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support