PHP implements Unicode and Utf-8 encoding for mutual conversion

Last Update:2016-06-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently just to use the conversion of Unicode encoding, went to check the PHP library function, actually did not find a function can be Unicode encoding and decoding the string! No matter, can not find the words on their own realization of ...

The difference between Unicode and Utf-8 encoding

Unicode is a character set, and UTF-8 is one of the Unicode, Unicode is fixed-length is double-byte, and UTF-8 is mutable, for Chinese characters, Unicode occupies 1 bytes less than the bytes occupied by UTF-8. Unicode is double-byte, while UTF-8 Chinese characters account for three bytes.
UTF-8 encoded characters can theoretically be up to 6 bytes long, whereas 16-bit BMP (Basic multilingual Plane) characters use up to 3 bytes long. Here's a look at the UTF-8 code table:

The position of XXX is filled in by the binary representation of the character encoding number, the more the right x has less special meaning, the shortest one is enough to express a character encoding number of multibyte string. Note In a multibyte string, the number of "1" at the beginning of the first byte is the number of bytes in the entire string. And the first line starts with 0, is to be compatible with ASCII encoding, for one byte, the second line is a double-byte string, the third behavior 3 bytes, such as the Chinese character belongs to this, and so on. (Personally think: in fact, we can simply put the number of the previous 1 as the number of bytes)

How does Unicode convert to Utf-8?

In order to convert Unicode to UTF-8, of course, you need to know where their differences are. Let's take a look at how the encoding in Unicode is converted to UTF-8, in UTF-8, if the byte of a character is less than 0x80 (128) is an ASCII character, takes up one byte, and can be used without conversion because UTF-8 is compatible with ASCII encoding. If the character "you" in Unicode is encoded as "u4f60", convert it to binary 100111101100000 and then convert it according to the UTF-8 method. Unicode binary can be removed from the low to high-level binary numbers, each fetch 6 bits, such as the above binary can be taken out as the following format, the front by the format to fill, less than 8 bits with 0 to fill.

unicode:100111101100000                   4f60utf-8:    11100100,10111101,10100000       e4bda0

From the above can be very intuitive to see the conversion between Unicode to UTF-8, of course, know the format of UTF-8, you can do inverse, is to format it in the corresponding position in the binary, and then the conversion is the resulting Unicode character (this operation can be done by "displacement "To complete.) As the above "you" conversion, because its value is greater than 0x800 less than 0x10000, so can be judged as three bytes of storage, then the highest bit needs to move to the right "12" bit and then the highest bit in three-byte format for 11100000 (0xE0) or (|) can get the highest value. Similarly, the second position is to move the "6" bit to the right, then the highest and second binary values are left, which can be 0x3F by 111111 (&) and 11000000 (0x80) or (|). The third place will not have to shift, as long as the last six (with 111111 (ox3f) Take &), in the 11000000 (0x80) or (|).

Utf-8 How to reverse the Unicode back?

The conversion of UTF-8 to Unicode is, of course, done by shifting and so on, by UTF-8 the binary numbers of the corresponding positions in those formats. In the above example, "You" is three bytes, so each byte is processed and there is a high-to-low processing. In the UTF-8 "you" for 11100100,10111101,10100000. From the high up that the first byte 11100100 is the "0100" to take out, this is very simple as long as and 11111 (0x1F) with (&), by three bytes can be learned that the most in place must be 12 bit before, because each fetch six bits. So we have to move the resulting results to the left 12 bits, the highest bit also completed the 0100,000000,000000. The second is to take the "111101" to get out, then only the second byte 10111101 and 111111 (0x3F) (&). The result of moving the resulting result to the left 6-bit and the highest byte is taken or (|), and the second bit is completed, and the resulting result is 0100,111101,000000. And so the last one directly with 111111 (&), and then with the results obtained from the previous 0x3F or (|) can be obtained results 0100,111101,100000.

PHP Code implementation

/** * UTF8 character converted to Unicode character * @param  [type] $utf 8_str Utf-8 character * @return [type]           Unicode character */function utf8_str_to_unic Ode ($utf 8_str) {    $unicode = 0;    $unicode = (ord ($utf 8_str[0) & 0x1F) <<;    $unicode |= (Ord ($utf 8_str[1]) & 0x3F) << 6;    $unicode |= (Ord ($utf 8_str[2]) & 0x3F);    Return Dechex ($unicode);} /** * Unicode characters converted to UTF8 characters * @param  [type] $unicode _str Unicode characters * @return [type]              Utf-8 characters */function Unicode_to_ut F8 ($unicode _str) {    $utf 8_str = ';    $code = Intval (Hexdec ($unicode _str));    Note here that the converted code must be shaped so that the bitwise operation is correct    _1 = Decbin (0xe0 | ($code >>));    _2 = Decbin (0x80 | (($code >> 6) & 0x3f));    _3 = Decbin (0x80 | ($code & 0x3f));    $utf 8_str = Chr (Bindec (_1)). Chr (Bindec (_2)). Chr (Bindec (_3));    return $utf 8_str;}

Test it.

$utf 8_str = ' I ';//This is the Unicode encoding of the kanji "you" $unicode_str = ' 4f6b ';//Output 6211echo Utf8_str_to_unicode ($utf 8_str). "
";//Output Kanji" You "echo Unicode_str_to_utf8 ($unicode _str);

These conversions are for Chinese characters "to big is non-ASCII" test, because if it is ASCII, it is the same, no need to pay so much effort.
What's more, these two functions are simply implemented, only support a single character "a complete utf8 character or a complete Unicode character" to convert each other, if you understand the words can be extended to the fullest ...



This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

PHP implements Unicode and Utf-8 encoding for mutual conversion

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

PHP implements Unicode and Utf-8 encoding for mutual conversion

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support