UTF8 encoding Algorithm Reprint _ Basic knowledge

Source: Internet
Author: User
Tags character set
The Unicode character set is the most perfect and comprehensive character set in the world, containing almost all the characters in the world. In fact, the Unicode character set is a huge table that formats the characters and punctuation marks of all languages in the world, and then gives each character the Fu number in a certain order (unfortunately for Chinese, this order is not in alphabetical order). With this huge table, most of the characters in the world have a Unicode inner code (integer) to correspond. The computer represents this character by recording the Unicode code of the character, and then gives it to the operating system, where the operating system converts the inner code into a font lattice display on our screen through the mapping of Unicode code to character font lattice.

UTF8 is our common coding method, using UTF8 encoding in web development to completely resolve character set problems. In fact, UTF8 is a physical implementation of the Unicode character set, which describes how to efficiently store Unicode inner code (that is, the character in the order code of the character set above), RFC2044 document (http://www.ietf.org/rfc/rfc2044.txt ? number=2044) Describes how to convert an algorithm from an inner code to a UTF8 format. The English is not OK, look at this conversion table will immediately understand:

UCS-4 range (hex.) UTF-8 octet sequence (binary)
0000 0000-0000 007F 0xxxxxxx
0000 0080-0000 07FF 110xxxxx 10xxxxxx
0000 0800-0000 FFFF 1110xxxx 10xxxxxx 10xxxxxx
0001 0000-001f FFFF 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
0020 0000-03ff FFFF 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
0400 0000-7fff FFFF 1111110x 10xxxxxx ... 10xxxxxx

The above table on the left is a 16-in-line Unicode code, the last row of the 16-digit "7FFF FFFF" is the maximum size of the UTF8 can be expressed, replaced by 10 is such a number: 2147483647 (big Enough:)) [ Sorry, This article was first written wrong here, corrected . The right-hand column in the table above is the binary format of the UTF8, and the rules of conversion are straightforward. I directly give the algorithm bar (JS code):


function ToUtf8 (code)
{
var ibyte=0;
var i=0;
Result= "";
while (code>0x7f)
{
ibyte=code%0x40;
Code= (code-ibyte)/0x40;
result= "%" + (ibyte|0x80). toString. toUpperCase () +result;
i++;
}
PREFIX=[0X0,0XC0,0XE0,0XF0,0XF8,0XFC];
if (i>prefix.length)
{
i=5;
}
result= "" + (Code|prefix[i]). toString. toUpperCase () +result;
return result;
}


          such as the character "Han" Unicode is 6c49, This Unicode character is represented as a large integer and then converted to multi-byte encoding 110110001001001:
          Observe the binary sequence of this integer (110,110001,001001)
          forward from back
           if the binary sequence has only 7 digits (less than 128, or ASCII characters), it takes the 7-bit binary number directly to form a UTF8 character. The
          above character "Han" binary sequence is greater than 7 digits, so take the latter 6 digits (1001001), plus 10 to form a UTF8 byte (10 001001, 16 in 89).
          the remaining binary sequence (110,110001) takes 6 digits from the back, plus 10 to form a UTF8 byte (10 110001,16 into the system B1).
          the remaining binary sequence (110) takes 6 bits forward from the back, due to less than 6 digits, the number and 1110000 phase or, Get the character 11100110,16 E6
          Finally, you get the UTF8 encoding, and the 16-in-binary notation is e6b189

            application domain
            Although most of these standard algorithms have been implemented by developers or libraries, we still need to implement this algorithm ourselves at some point:
            Some browsers (IE5) do not support encodeURI functions . Then there are two options for submitting Chinese characters using Ajax:
                  character Fu Ching escape to form a character such as "%uxxxx", the server uses the above algorithm to convert the Unicode serial number after u to the UTF8 character
                  use the above algorithm to combine escape to implement encodeURI functions directly on the client (recommended use of this scheme)

           &NBSP implementing JSON-RPC Services
            JSON is the object direct form of JavaScript, Where the string must be a Unicode character, the Chinese character needs to be converted into a "\uxxxx" form. So we need to transform the service-side characters into JSON. For PHP, there are now two open source projects json-php  and Php-json.
            Json-rpc is an RPC protocol with JSON as the data format. Can be conveniently applied with Ajax projects, json-rpc.org is an open source implementation.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.