How PHP converts Unicode and Utf-8 encoding

Source: Internet
Author: User
This article introduces how to use PHP to implement Unicode encoding and decoding for strings. For more information, see unicode conversion, I checked the php library function and did not find a function to encode and decode the character string! If you cannot find it, implement it yourself...
Differences between Unicode and Utf-8 encoding

Unicode is a character set, while UTF-8 is one of Unicode, Unicode is always double byte, and UTF-8 is variable, for Chinese characters, Unicode occupies 1 byte less than the UTF-8. Unicode is double byte, while Chinese characters in the UTF-8 are three bytes.
In theory, UTF-8 encoding characters can be up to 6 bytes long, but 16-bit BMP (Basic Multilingual Plane) characters can be up to 3 bytes long. Let's take a look at the UTF-8 encoding table:

U-00000000-U-0000007F: 0 xxxxxxx
U-00000080-U-000007FF: 110 xxxxx 10 xxxxxx
U-00000800-U-0000FFFF: 1110 xxxx 10 xxxxxx 10 xxxxxx
U-00010000-U-001FFFFF: 11110xxx 10 xxxxxx 10 xxxxxx 10 xxxxxx
U-00200000-U-03FFFFFF: 111110xx 10 xxxxxx 10 xxxxxx 10 xxxxxx
U-04000000-U-7FFFFFFF: 1111110x 10 xxxxxx 10 xxxxxx 10 xxxxxx 10 xxxxxx

The position of xxx is filled by the bits in the binary representation of the number of characters encoded. the closer the value is to the right, the less special the meaning is. only the shortest one is enough to express the multi-byte string of the number of characters encoded. Note that in a multi-byte string, the number of "1" starting with the first byte is the number of bytes in the entire string. The first line starts with 0 to be ASCII-compatible. it is a byte. The second line is a dual-byte string. the third row is a three-byte string, such as a Chinese character, and so on. (In my opinion, we can simply regard the number of first 1 as the number of bytes)

How to convert Unicode to Utf-8?

To convert Unicode to a UTF-8, you certainly need to know where the difference is. Next let's take a look at how the encoding in Unicode is converted to a UTF-8, in the UTF-8, if a character's byte is less than 0x80 (128) is ASCII character, occupies a byte, you do not need to convert because the UTF-8 is compatible with ASCII encoding. In Unicode, if the Chinese character "you" is encoded as "u4F60", convert it to binary 100111101100000, and then convert it according to the UTF-8 method. The Unicode binary can be retrieved from the low position to the high position, with 6 digits each time. for example, the preceding binary can be extracted in the following format, fill in less than 8 digits with 0.

Unicode: 100111101100000 4F60

UTF-8: 11100100,10111101, 10100000 E4BDA0

From the above can be very intuitive to see the conversion between Unicode to the UTF-8, of course, know the UTF-8 format, you can carry out the inverse operation, it is to extract it from the corresponding position in the binary according to the format, and then convert it to the Unicode character (this operation can be completed by "displacement ). For example, because the value of your conversion is greater than 0x800 and less than 0x10000, it can be regarded as three-byte storage, then, the maximum bit needs to be shifted to the right "12" and then calculated or (|) based on the maximum bit of the three-byte format as 11100000 (0xE0) to get the maximum bit value. Similarly, if the second digit is the right shift of "6" digits, then the highest and second binary values are left. you can perform the operation by location (&) with 111111 (0x3F, then evaluate or (|) with 11000000 (0x80 ). The third digit does not need to be shifted. as long as the last six digits (with 111111 (ox3F) and 11000000 (0x80) are taken directly, or (|) is obtained ).

How does the Utf-8 reverse Unicode?

Of course, the conversion from UTF-8 to Unicode is also completed by shift, is to pull out the binary number of the corresponding location of the UTF-8 format. In the preceding example, "you" is three bytes. Therefore, each byte is required for processing, from high to low. In the UTF-8 you are 11100100100,10111101, 10100000. Starting from the high position, the first byte 11100100 is to give out the "0100". This is very simple, as long as the sum of 11111 (0x1F) and (&), it can be learned from three bytes that the most in place is definitely before 12 bits, because each time we take six bits. Therefore, we also need to shift the result to 12 places left, and the highest bit will complete 000000. While the second digit is to give "111101", you only need to take the second byte 10111101 and 111111 (0x3F) and (&). After moving the obtained result to the left by 6 bits and the highest byte, the result is (|), and the second bits are completed. the obtained result is 000000. And so on, get and (&) directly with 111111 (0x3F), and get or (|) with the preceding result to get the results 100000.

PHP code implementation:

/*** Utf8 character conversion to Unicode CHARACTER * @ param [type] $ utf8_str Utf-8 character * @ return [type] Unicode CHARACTER */function utf8_str_to_unicode ($ utf8_str) {$ unicode = 0; $ unicode = (ord ($ utf8_str [0]) & 0x1F) <12; $ unicode | = (ord ($ utf8_str [1]) & 0x3F) <6; $ unicode | = (ord ($ utf8_str [2]) & 0x3F); return dechex ($ unicode );} /*** convert Unicode character to utf8 character * @ param [type] $ unicode_str Unicode CHARACTER * @ return [type] Utf-8 character */function unicode_to_utf8 ($ unicode_str) {$ utf8_str = ''; $ code = intval (hexdec ($ unicode_str); // The converted code must be an integer, in this way, the correct bitwise operation $ ord_1 = decbin (0xe0 | ($ code> 12 )); $ ord_2 = decbin (0x80 | ($ code> 6) & 0x3f); $ ord_3 = decbin (0x80 | ($ code & 0x3f )); $ utf8_str = chr (bindec ($ ord_1 )). chr (bindec ($ ord_2 )). chr (bindec ($ ord_3); return $ utf8_str ;}

Tested

$ Utf8_str = 'I'; // This is the Unicode code of the Chinese character "you" $ unicode_str = '4f6b '; // output 6211 echo utf8_str_to_unicode ($ utf8_str )."
"; // Output the Chinese character" you "echo unicode_str_to_utf8 ($ unicode_str );

These conversions are tests for Chinese characters (non-ASCII) and only support conversion between a single character [a complete utf8 character or a complete Unicode character, I hope this will be helpful for your learning.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.