UTF-8 encoding algorithm

Source: Internet
Author: User
Tags rfc uppercase letter

Unicode Character Set is the most comprehensive character set in the world. It contains almost all the characters in the world. In fact, it can be understood that the Unicode Character Set is a huge table that orchestrates the characters and punctuation marks of various languages in the world, then, sort each character in a certain order (unfortunately, this order is not in the Chinese pinyin order ). With this huge table, most characters in the world have a Unicode Internal code (integer. The computer records the Unicode code of the character and then delivers it to the operating system, the operating system maps the Unicode code to the Character Font dot matrix to convert the internal code into a font dot matrix and display it on our screen.

Utf8 is a common encoding method. Using utf8 encoding in Web development can completely solve Character Set problems. In fact, utf8 is a physical implementation of the Unicode character set. It describes how to efficiently store the Unicode Internal code (that is, the sequence code of the characters mentioned above in the character set), rfc2044 (http: // www.ietf.org/rfc/rfc2044.txt? Number = 2044) describes how to convert an internal code to an utf8 format algorithm. It doesn't matter if the English is not good. You can see the conversion table immediately:

UCS-4 range (Hex.) UTF-8 octet sequence (Binary)
0000 0000-0000 007f 0 xxxxxxx
0000 0080-0000 07ff 110 XXXXX 10 xxxxxx
0000 0800-0000 FFFF 1110 XXXX 10 xxxxxx 10 xxxxxx
0001 0000-001f FFFF 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
0020 -03ff FFFF 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
0400 2017-7fff FFFF 1111110x 10 xxxxxx... 10 xxxxxx

The left side of the above table is the Unicode Internal code in hexadecimal notation. The hexadecimal number "7fff ffff" in the last row is the maximum value of the internal code that utf8 can represent, the 10-digit conversion is like this: 2147483647 (large enough :)) [Sorry, this article was originally written incorrectly and has been corrected]. The right column of the table above is the binary format of utf8, and the conversion rules are clear at a glance. Let me give the algorithm directly (JS Code ):

Function toutf8 (CODE)
{
VaR ibyte = 0;
VaR I = 0;
Result = "";
While (code> 0x7f)
{
Ibyte = code % 0x40;
Code = (Code-ibyte)/0x40;
Result = "%" + (ibyte | 0x80). tostring (16). touppercase () + result;
I ++;
}
Prefix = [0x0, 0xc0, 0xe0, 0xf0, 0xf8, 0xfc];
If (I> prefix. length)
{
I = 5;
}
Result = "" + (Code | prefix [I]). tostring (16). touppercase () + result;
Return result;
}

For example, the Unicode Character "Han" is 6c49, which is expressed as a large integer and then converted into multi-byte encoding 110110001001001:
Observe the binary code sequence of this INTEGER (001001)
Retrieve from and forward
If the binary sequence has only the last seven digits (less than 128 characters, that is, ASCII characters), the last seven digits of the binary number are taken to form an utf8 character.
The binary sequence of the preceding character "Han" is greater than 7 digits. Therefore, take the last 6 digits (1001001) and add 10 to form an utf8 byte (10 001001, hexadecimal 89 ).
The remaining binary sequence (110,110001) takes 6 digits from the back forward and adds 10 to form a utf8 byte (10 110001,16th hexadecimal B1 ).
The remaining binary sequence (110) is used to take 6 digits from the back and forward. Because there are less than 6 digits, the number is equal to 1110000, and the character 11100110,16hexadecimal E6 is obtained.
Finally, utf8 encoding is obtained. The hexadecimal representation is e6b189.

[Application field]
Although most of these standard algorithms have been implemented by development tool providers or libraries, we still need to implement these algorithms on our own in some cases:
Some browsers (ie5) do not support the encodeuri function. There are two ways to submit Chinese characters using Ajax:
Characters such as "% uxxxx" are converted by escape. The server uses the preceding algorithm to convert the unicode sequence number after U into utf8 characters.
Use the above algorithm in combination with escape to implement the encodeuri function directly on the client (this solution is recommended)

Implement the JSON-RPC service
JSON is a JavaScript Object in the form of a direct amount, where the string must be a Unicode character, and the Chinese character must be converted to "\ uxxxx. Therefore, we need to convert the server characters in JSON format. For PHP, there are now two open source project JSON-PHP and PHP-JSON.
JSON-RPC is an RPC protocol in JSON format, which can be easily used in Ajax projects. json-rpc.org is an open source implementation.

Reference 2:

5 unicode encoding
5.1 usage

The standard used to encode all characters.
5.2 Overview

The industrial standard that encodes all languages in the world can represent about 1 million different symbols.

The latest standard is unicode5.0.
5.3 Basic Principles

The Unicode ing method represents Unicode. Unicode has two ways of ing: UTF (Unicode Transformation Format) encoding and ucal (Universal Character Set) encoding, and UTF-8 and UTF-16 are the two most used encoding.

UTF-8 is an ASCII superset, different, but also the biggest feature of UTF-8, that is, it is a variable length encoding method. It can use 1 ~ The four bytes indicate a symbol, and the length of the byte varies according to different symbols.

UTF-8 coding rules are very simple, only two:

1) for a single-byte symbol, the first byte is set to 0, and the last seven digits are the Unicode code of this symbol. Therefore, for English letters, the UTF-8 encoding and ASCII code are the same.

2) for the n-byte symbol (n> 1), the first N bits of the first byte are set to 1, and the N + 1 bits are set to 0, the first two bytes are set to 10. The remaining unmentioned binary bits are all Unicode codes of this symbol.

The following table summarizes the encoding rules. The letter X indicates the available encoding bits.

Unicode symbol range | UTF-8 encoding method

(Hexadecimal) | (Binary)

-------------------- + ---------------------------------------------

0000 0000-0000 007f | 0 xxxxxxx

0000 0080-0000 07ff | 110 XXXXX 10 xxxxxx

0000 0800-0000 FFFF | 1110 XXXX 10 xxxxxx 10 xxxxxx

0001 0000-0010 FFFF | 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

The following uses the Chinese character "strict" as an example to demonstrate how to implement UTF-8 encoding.

It is known that the Unicode of "strict" is 4e25 (100111000100101). According to the above table, we can find that 4e25 is in the range of the third row (0000-0800 FFFF ), therefore, the "strict" UTF-8 encoding requires three bytes, that is, the format is "1110 XXXX 10 xxxxxx 10xxxxxx ". Then, starting from the last binary bit of "strict", fill in X in the format from the back to the front, and fill the extra bit with 0. In this way, the "strict" UTF-8 code is "11100100 10111000 10100101", converted to hexadecimal is e4b8a5.

The first byte is the "Big endian", and the second byte is the "little endian ).

Naturally, a problem arises: how does a computer know which encoding method is used for a file?

As defined in the Unicode specification, each file is preceded by a character indicating the encoding order. The character is called "Zero Width, non-line feed space" (Zero Width, no-break space ), expressed in feff. This is exactly two bytes, and FF is 1 larger than Fe.

If the first two bytes of a text file are Fe ff, it indicates that the file adopts the big header mode. If the first two bytes are FF Fe, it indicates that the file adopts the Small Header mode.

Conversion between Unicode and UTF-8

Through the example, we can see that the "strict" Unicode code is 4e25, The UTF-8 code is e4b8a5, the two are different. The conversion between them can be implemented through a program.

On the Windows platform, there is one of the simplest transformations. Instead, you can use the built-in deployment mini-program notepad.exe. After opening the file, click the "Save as" command in the "file" menu to pop up a dialog box with a "encoding" drop-down at the bottom.

Bg2007102801.jpg

There are four options: ANSI, Unicode, Unicode big endian and UTF-8.

1) ANSI is the default encoding method. English files are ASCII encoded files, while simplified Chinese files are gb2312 encoded files (only for Windows Simplified Chinese versions, if they are traditional Chinese versions, big5 codes will be used ).

2) unicode encoding refers to the UCS-2 encoding method, that is, directly using two bytes into the character Unicode code. This option uses the little endian format.

3) Unicode big endian encoding corresponds to the previous option. In the next section, I will explain the meanings of little endian and big endian.

4) UTF-8 coding, that is, the encoding method mentioned in the previous section.

After selecting the encoding method, click the Save button to convert the file encoding method immediately.

Open the program notepad.exe, create a text file, the content is a "strict" word, in turn using ANSI, Unicode, Unicode big endian and UTF-8 encoding to save.

Then, use the "hexadecimal function" in the text editing software ultraedit to observe the internal encoding mode of the file.

1) ANSI: The file encoding is the two-Byte "D1 CF", which is the "strict" gb2312 encoding, which also implies that gb2312 is stored in a large-headed manner.

2) UNICODE: the encoding is four bytes: "FF Fe 25 4E", where "ff fe" indicates that it is stored in a small header, and the actual encoding is 4e25.

3) Unicode big endian: the encoding format is four bytes: "Fe FF 4E 25". "Fe FF" indicates that it is stored as a large data source.

4) UTF-8: the encoding is six bytes "Ef bb bf E4 B8 A5", the first three bytes "Ef bb bf" indicates that this is UTF-8 encoding, the last three "e4b8a5" are "strict" encoding, and their storage sequence is consistent with the encoding sequence.
5.4 encoding and decoding code

5.5 References

[1] the absolute minimum every software developer absolutely, positively must know about Unicode and character sets (no excuses !). Http://www.joelonsoftware.com/articles/Unicode.html

[2] character encoding notes: ASCII, Unicode and UTF-8. http://www.ruanyifeng.com/blog/2007/10/ascii_unicode_and_utf-8.html

[3] UTF-8, a transformation format of ISO 10646. http://www.ietf.org/rfc/rfc3629.txt

[4] univer, an excellent tool to convert a batch of plain text or HTML files in varous characters set encoding to Unicode or UTF-8 encoding. http://www.programurl.com/unifier.htm

Reference 3:
Character Set code range:
VaR _ charset = {
'Cjk ': ['u4e00', 'u9fa5'], // Chinese Character [1-Example]
'Num': ['u0030', 'u0039 '], // number [0-9]
'Lal': ['u0061 ', 'u007a'], // lowercase letter [A-Z]
'Ual': ['u0041', 'u005a '], // uppercase letter [A-Z]
'Asc': ['u0020', 'u007e '] // ASCII visual character
};

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.