Mutual Conversion between Unicode and Utf-8 encoding in PHP
Recently, I used unicode encoding conversion. I checked the php library function and did not find a function that can encode and decode the string! If you cannot find it, implement it yourself...
Differences between Unicode and Utf-8 encoding
Unicode is a character set, while UTF-8 is one of Unicode, Unicode is always double byte, and UTF-8 is variable, for Chinese characters, Unicode occupies 1 byte less than the UTF-8. Unicode is double byte, while Chinese characters in the UTF-8 are three bytes.
In theory, UTF-8 encoding characters can be up to 6 bytes long, but 16-bit BMP (Basic Multilingual Plane) characters can be up to 3 bytes long. Let's take a look at the UTF-8 encoding table:
U-00000000 - U-0000007F: 0xxxxxxx U-00000080 - U-000007FF: 110xxxxx 10xxxxxx U-00000800 - U-0000FFFF: 1110xxxx 10xxxxxx 10xxxxxx U-00010000 - U-001FFFFF: 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx U-00200000 - U-03FFFFFF: 111110xx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx U-04000000 - U-7FFFFFFF: 1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
The position of xxx is filled by the bits in the binary representation of the number of characters encoded. the closer the value is to the right, the less special the meaning is. only the shortest one is enough to express the multi-byte string of the number of characters encoded. Note that in a multi-byte string, the number of "1" starting with the first byte is the number of bytes in the entire string. The first line starts with 0 to be ASCII-compatible. it is a byte. The second line is a dual-byte string. the third row is a three-byte string, such as a Chinese character, and so on. (In my opinion, we can simply regard the number of first 1 as the number of bytes)
How to convert Unicode to Utf-8?
To convert Unicode to a UTF-8, you certainly need to know where the difference is. Next let's take a look at how the encoding in Unicode is converted to a UTF-8, in the UTF-8, if a character's byte is less than 0x80 (128) is ASCII character, occupies a byte, you do not need to convert because the UTF-8 is compatible with ASCII encoding. In Unicode, if the Chinese character "you" is encoded as "u4F60", convert it to binary 100111101100000, and then convert it according to the UTF-8 method. The Unicode binary can be retrieved from the low position to the high position, with 6 digits each time. for example, the preceding binary can be extracted in the following format, fill in less than 8 digits with 0.
unicode: 100111101100000 4F60utf-8: 11100100,10111101,10100000 E4BDA0
From the above can be very intuitive to see the conversion between Unicode to the UTF-8, of course, know the UTF-8 format, you can carry out the inverse operation, it is to extract it from the corresponding position in the binary according to the format, and then convert it to the Unicode character (this operation can be completed by "displacement ). For example, because the value of your conversion is greater than 0x800 and less than 0x10000, it can be regarded as three-byte storage, then, the maximum bit needs to be shifted to the right "12" and then calculated or (|) based on the maximum bit of the three-byte format as 11100000 (0xE0) to get the maximum bit value. Similarly, if the second digit is the right shift of "6" digits, then the highest and second binary values are left. you can perform the operation by location (&) with 111111 (0x3F, then evaluate or (|) with 11000000 (0x80 ). The third digit does not need to be shifted. as long as the last six digits (with 111111 (ox3F) and 11000000 (0x80) are taken directly, or (|) is obtained ).
How does the Utf-8 reverse Unicode?
Of course, the conversion from UTF-8 to Unicode is also completed by shift, is to pull out the binary number of the corresponding location of the UTF-8 format. In the preceding example, "you" is three bytes. Therefore, each byte is required for processing, from high to low. In the UTF-8 you are 11100100100,10111101, 10100000. Starting from the high position, the first byte 11100100 is to give out the "0100". This is very simple, as long as the sum of 11111 (0x1F) and (&), it can be learned from three bytes that the most in place is definitely before 12 bits, because each time we take six bits. Therefore, we also need to shift the result to 12 places left, and the highest bit will complete 000000. While the second digit is to give "111101", you only need to take the second byte 10111101 and 111111 (0x3F) and (&). After moving the obtained result to the left by 6 bits and the highest byte, the result is (|), and the second bits are completed. the obtained result is 000000. And so on, get and (&) directly with 111111 (0x3F), and get or (|) with the preceding result to get the results 100000.
PHP code implementation
/*** Utf8 character conversion to Unicode CHARACTER * @ param [type] $ utf8_str Utf-8 character * @ return [type] Unicode CHARACTER */function utf8_str_to_unicode ($ utf8_str) {$ unicode = 0; $ unicode = (ord ($ utf8_str [0]) & 0x1F) <12; $ unicode | = (ord ($ utf8_str [1]) & 0x3F) <6; $ unicode | = (ord ($ utf8_str [2]) & 0x3F); return dechex ($ unicode );} /*** convert Unicode character to utf8 character * @ param [type] $ unicode_str Unicode CHARACTER * @ return [type] Utf-8 character */function unicode_to_utf8 ($ unicode_str) {$ utf8_str = ''; $ code = intval (hexdec ($ unicode_str); // The converted code must be an integer, in this way, the correct bitwise operation $ ord_1 = decbin (0xe0 | ($ code> 12 )); $ ord_2 = decbin (0x80 | ($ code> 6) & 0x3f); $ ord_3 = decbin (0x80 | ($ code & 0x3f )); $ utf8_str = chr (bindec ($ ord_1 )). chr (bindec ($ ord_2 )). chr (bindec ($ ord_3); return $ utf8_str ;}
Tested
$ Utf8_str = 'I'; // This is the Unicode code of the Chinese character "you" $ unicode_str = '4f6b '; // output 6211 echo utf8_str_to_unicode ($ utf8_str )."
"; // Output the Chinese character" you "echo unicode_str_to_utf8 ($ unicode_str );
The above conversions are aimed at testing the Chinese character [speaking to the big but not ASCII], because if it is ASCII, the conversion is the same, and it does not take that much time.
In addition, these two functions are simply implemented. they only support conversion between a single character [a complete utf8 character or a complete Unicode character, if you understand it, you can extend it as much as you like...