Implementation of UTF-8 coding via JavaScript _javascript techniques

Source: Internet
Author: User

Character set for javascript:

JavaScript programs are written using the Unicode character set. Unicode is a superset of ASCII and Latin-1, and supports almost all languages on Earth. ECMASCRIPT3 requires JavaScript to support Unicode2.1 and subsequent versions, ECMASCRIPT5 requires support for UNICODE3 and subsequent versions. So, we write out the

JavaScript programs, all of which are encoded using Unicode.

UTF-8

UTF-8 (utf8-bit Unicode Transformation Format) is a variable-length character encoding for Unicode and a prefix code.

It can be used to represent any character in the Unicode Standard, and the first byte in its encoding is still compatible with ASCII, which allows the software that originally handles ASCII characters to be used without or requiring only a small portion of the modification. As a result, it gradually becomes the preferred encoding for e-mail, Web pages and other applications that store or send text.

Most of the sites currently used are UTF-8 encoded.

Converts a JavaScript-generated Unicode encoded string to a UTF-8-encoded string

As the title says the application scenario is very common, such as sending a section of binary to the server, the server stipulates that the binary content encoding must be UTF-8. In this case, we must use the program to convert the JavaScript Unicode string into a UTF-8 encoded string.

Conversion methods

Before converting, we must understand that Unicode has a fixed encoding structure.

Do not believe you can try String charcodeat this method to see the return of charcode accounted for a few bytes.

• 1 characters in English, 2 characters in Chinese characters

However, the length of the encoding structure of the UTF-8 determines how long the length is based on the size of a single character.

The following is a few bytes for the size of a single character. The maximum length after a single Unicode character encoding is 6 bytes.

• 1 bytes: Unicode code is 0-127
• 2 bytes: Unicode code is 128-2047
• 3 bytes: Unicode code is 2048-0XFFFF
• 4 bytes: Unicode code is 65536-0X1FFFFF
• 5 bytes: Unicode code is 0X200000-0X3FFFFFF
• 6 bytes: Unicode code is 0X4000000-0X7FFFFFFF

Please look at the picture:

Because the Unicode code for English and Chinese characters is 0-127, the length and byte of English in Unicode and UTF-8 are consistent and occupy only 1 bytes. That's why UTF8 is a hyper-set of Unicode!

Now we will discuss Chinese characters, because the Unicode code range of Chinese characters is 0x2e80-0x9fff, so the length of Chinese characters in UTF8 is 3 bytes.

So how does the Chinese character convert from 2 bytes of Unicode to UTF8 three bytes?

Let's say I need to convert Chinese "medium" to UTF-8 code.

1. Get the size of Unicode value of Chinese characters

var str = ' Medium ';
var charcode = str.charcodeat (0);
Console.log (CharCode); => 20013

2, according to the size of the judge UTF8 length

From the last step we get the charcode of Chinese character "Zhong" as 20013. Then we find that 20,013 is in the 2048-0xffff, so the Chinese character "medium" should occupy 3 bytes in the UTF8.

3, complement

Now that you know the Chinese character "medium" needs to occupy 3 bytes, how do these 3 bytes get?

This needs to design to the complement, the specific complement logic is as follows:

OK, I know this picture you also can not understand, or I say it!

The specific complement code is as follows, and "X" denotes the vacancy, which is used to complement the position.

0xxxxxxx
110xxxxx 10xxxxxx
1110xxxx 10xxxxxx 10xxxxxx
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

Warning: Have you found anything? The first byte of the complement code, preceded by a few 1, indicates how many bytes the entire UTF-8 encoding occupies! UTF-8 decoding for Unicode is the use of this feature oh ~

Let's start with a simple example. Convert the English letter "A" to UTF8 code.

1, "a" charcode to 65
2, 65 bits in 0-127, so "a" takes up a byte.
3, the UTF8 of a byte of the complement for 0xxxxxxx,x is the vacancy, is used to complement the position.
4, convert 65 to binary to get 1000001
5, the 1000001 in the order of the former to the back, in turn to fill the 1xxxxxxx vacancy, get 01000001
6, convert 11000001 to String, get "A"
7. Eventually, "a" is UTF8 encoded "a"

With this small example, do we again verify that UTF-8 is a Unicode superset!

Well, we're back to the Chinese character "CharCode" before we've got the "medium" of 20013, and the binary is 01001110 00101101. Specifically as follows:

var code = 20013;
Code.tostring (2); 
=> 100111000101101 equals 01001110 00101101.

Then, we follow the above "A" complement method, to the "medium" complement.
The 01001110 00101101 is added to the 1110xxxx 10xxxxxx 10xxxxxx in the order of the former to the back. Get 11100100 10111000 10101101.

4, get the content of UTF8 coding

With the above steps, we get the three UTF8 bytes in "Medium", 11100100 10111000 10101101.

We will convert each byte into 16, get 0xe4 0xb8 0xAD;
So this 0xe4 0xb8 0xAD is the UTF8 code we eventually get.

We use Nodejs buffer to verify that it is correct.

var buffer = new buffer (' Medium '); 
Console.log (buffer.length); => 3
console.log (buffer);//=> <buffer e4 b8 ad>
//final three bytes 0xe4 0xb8 0xad

Because 16 is a case-insensitive, so is not with us calculated 0xe4 0xb8 0xAD exactly the same.

The

writes the encoding logic above to a function.

format string to UTF8 encoded byte var writeutf = function (str, isgetbytes) {var back = [];
   var bytesize = 0;
     for (var i = 0; i < str.length i++) {var code = str.charcodeat (i);
        if (0x00 <= code && Code <= 0x7f) {bytesize = 1;
     Back.push (code);
        else if (0x80 <= code && Code <= 0x7ff) {bytesize = 2; Back.push ((192 |
        (& (Code >> 6))); Back.push ((128 |
     (& Code)) else if ((0x800 <= code && Code <= 0xd7ff) | | (0xe000 <= Code && Code <= 0xffff))
        {bytesize = 3; Back.push ((224 |
        (& (Code >> 12))); Back.push ((128 |
        (& (Code >> 6))); Back.push ((128 |
     (& Code))
    for (i = 0; i < back.length i++) {back[i] &= 0xFF;
    } if (isgetbytes) {return back} if (ByteSize <= 0xff) {return [0, Bytesize].concat (back); } else {return [bytesize >> 8, ByteSize & 0xff].concat (back); } writeutf (' Medium '); => [0, 3, 228, 184, 173]//The first two digits represent the length of the trailing utf8 bytes. Because the length is 3, the first two bytes are ' 0,3 '//content is ' 228, 184, 173 ' turn into 16 is ' 0xe4 0xb8 0xAD '
Reads UTF8 encoded bytes and is specifically for Unicode string
var readUTF = function (arr) {
  if (typeof arr = = ' string ') {return
    arr;
  }
  var UTF = ', _arr = This.init (arr);
  for (var i = 0; i < _arr.length i++) {
    var one = _arr[i].tostring (2),
        v = one.match/^1+? =0)/);
    if (v && one.length = 8) {
      var byteslength = v[0].length;
      var store = _arr[i].tostring (2). Slice (7-byteslength);
      for (var st = 1; St < byteslength; st++) {
        store + + _arr[st + i].tostring (2). Slice (2)
      }
      UTF + = string.fr Omcharcode (parseint (store, 2));
      i + = BytesLength-1
    } else {
      UTF + = String.fromCharCode (_arr[i])
    } return
  UTF
}

readUTF ([0, 3, 228, 184, 173]); => ' Middle '

Another way to get UTF8 byte code from Chinese parsing

Another simple way to convert Chinese to UTF8 bytecode is simpler, and the browser provides a way, and this method is always used, what is it? Is encodeURI. Of course, encodeURIComponent is also possible.

Yes, that's the way it is. So how does this approach translate a Unicode-encoded Chinese into a UTF8 byte code?

var str = ' Medium ';

var code = encodeURI (str);

Console.log (code); =>%e4%b8%ad

Have you found an escaped string, and the contents of this string are the same as the byte code I got on the previous one.

Let's convert%e4%b8%ad to a number array.

var codeList = code.split ('% ');

CodeList = Codelist.map (item => parseint (item,16));

Console.log (codeList); => [228, 184, 173]

So simple, there are wood ~ ~ ~

What is the principle of this simple method?

Here is the question of QueryString encoding in the URI involved. Because the querystring in the URI must be transmitted according to the UTF8 encoding, and JavaScript is Unicode, the browser provides us with a method, that is, the Encodeuri/encodeuricomponent method. This method will speak
Non-English characters (consider here, why are these non-English characters?) First into the UTF8 byte code, and then add a% before stitching, so we will be the Chinese characters "in" Escape under the "%e4%b8%ad".

Well, that's the theory, there's nothing else.

However, the disadvantage of this approach is that it only escapes non-English characters, so when we need to format English characters as UTF8 encoding, this method is not up to our needs, and we also need to escape the English characters.

So what do I have to do to come back? It's OK to use decodeuri/decodeuricomponent.

var codeList = [228, 184, 173];

var code = CODELIST.MAP (item => '% ' +item.tostring). Join (");

decodeURI (code); In =>

All right, here's how the UTF8 code is introduced.
I hope we can help you understand the principle of UTF-8 coding.

The above is a small series for everyone to bring through the JavaScript UTF-8 code to achieve the full content of the method, I hope that we support cloud Habitat Community ~

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.