Utf8 encoding algorithm reprinted _ basic knowledge

Source: Internet
Author: User
Utf8 encoding algorithm reprinted unicode character set is the most comprehensive character set in the world, including almost all the characters in the world. In fact, it can be understood that the unicode character set is a huge table that orchestrates the characters and punctuation marks of various languages in the world, then, sort each character in a certain order (unfortunately, this order is not in the Chinese pinyin order ). With this huge table, most characters in the world have a unicode internal code (integer. The computer records the unicode code of the character and then delivers it to the operating system, the operating system maps the unicode code to the character font dot matrix to convert the internal code into a font dot matrix and display it on our screen.

Utf8 is a common encoding method. using utf8 encoding in web development can completely solve character set problems. In fact, utf8 is a physical implementation of the unicode character set. it describes how to efficiently store the unicode internal code (that is, the sequence code of the characters mentioned above in the character set), RFC2044 (http: // www.ietf.org/rfc/rfc2044.txt? Number = 2044) describes how to convert an internal code to an utf8 format algorithm. It doesn't matter if the English is not good. you can see the conversion table immediately:

UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0 xxxxxxx
0000 0080-0000 07FF 110 xxxxx 10 xxxxxx
0000 0800-0000 FFFF 1110 xxxx 10 xxxxxx 10 xxxxxx
0001 0000-001F FFFF 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
0020 -03ff FFFF 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
0400 2017-7fff FFFF 1111110x 10 xxxxxx... 10 xxxxxx

The left side of the above table is the unicode internal code in hexadecimal notation. the hexadecimal number "7FFF FFFF" in the last row is the maximum value of the internal code that utf8 can represent, the 10-digit number is as follows: 2147483647 (large enough :))[Sorry, this article was originally written incorrectly and has been corrected.]. The right column of the table above is the binary format of utf8, and the conversion rules are clear at a glance. Let me give the algorithm directly (js code ):


Function toUtf8 (code)
{
Var iByte = 0;
Var I = 0;
Result = "";
While (code> 0x7f)
{
IByte = code % 0x40;
Code = (code-iByte)/0x40;
Result = "%" + (iByte | 0x80). toString (16). toUpperCase () + result;
I ++;
}
Prefix = [0x0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc];
If (I> prefix. length)
{
I = 5;
}
Result = "" + (code | prefix [I]). toString (16). toUpperCase () + result;
Return result;
}


For example, the unicode character "Han" is 6C49, which is expressed as a large integer and then converted into multi-byte encoding 110110001001001:
Observe the binary code sequence of this integer (001001)
Retrieve from and forward
If the binary sequence has only the last seven digits (less than 128 characters, that is, ascii characters), the last seven digits of the binary number are taken to form an utf8 character.
The binary sequence of the preceding character "Han" is greater than 7 digits. therefore, take the last 6 digits (1001001) and add 10 to form an utf8 byte (10 001001, hexadecimal 89 ).
The remaining binary sequence (110,110001) takes 6 digits from the back forward and adds 10 to form a utf8 byte (10 110001,16th hexadecimal B1 ).
The remaining binary sequence (110) is used to take 6 digits from the back and forward. because there are less than 6 digits, the number is equal to 1110000, and the character 11100110,16hexadecimal E6 is obtained.
Finally, utf8 encoding is obtained. the hexadecimal representation is E6B189.

[Application field]
Although most of these standard algorithms have been implemented by development tool providers or libraries, we still need to implement these algorithms on our own in some cases:
Some browsers (ie5) do not support the encodeURI function, There are two ways to submit Chinese characters using ajax:
Characters such as "% uXXXX" are converted by escape. the server uses the preceding algorithm to convert the unicode sequence number after u into utf8 characters.
Use the above algorithm in combination with escape to implement the encodeURI function directly on the client (this solution is recommended)

Implement the json-rpc service
Json is a javascript object in the form of a direct amount, where the string must be a unicode character, and the Chinese character must be converted to "\ uXXXX. Therefore, we need to convert the server characters in json format. For php, there are now two open source project JSON-PHP and PHP-JSON.
Json-rpc is an rpc protocol in json format, which can be easily used in ajax Projects. json-rpc.org is an open source implementation.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.