How PHP implements Unicode and Utf-8 code conversion

How PHP implements Unicode and Utf-8 code conversion _php Skills

Last Update:2017-01-19 Source: Internet

Author: User

Tags chr ord

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently, just to use Unicode encoding conversion, I went to check the PHP library function, but did not find a function can be Unicode encoding and decoding of strings! Or, you can not find the words to achieve their own ...
The difference between Unicode and Utf-8 encoding

Unicode is a character set, and UTF-8 is one of the Unicode, Unicode is both double byte, and UTF-8 is variable, for the Chinese character Unicode occupies less than UTF-8 bytes occupied by 1 bytes. Unicode is double-byte, while the Chinese character in UTF-8 is three bytes.
UTF-8 encoded characters can theoretically be up to 6 bytes long, whereas 16-bit BMP (Basic multilingual Plane) characters use up to 3 bytes long. Here's a look at the UTF-8 code table:

U-00000000-u-0000007f:0xxxxxxx
U-00000080-u-000007ff:110xxxxx 10xxxxxx
U-00000800-u-0000ffff:1110xxxx 10xxxxxx 10xxxxxx
U-00010000-u-001fffff:11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
U-00200000-U-03FFFFFF:111110XX 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx
u-04000000-u-7fffffff:1111110x 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx 10xxxxxx

The position of XXX is filled by a bit of the binary representation of the character encoding number, and the more right x has the less special meaning, with the shortest one enough to express a multibyte string with a character encoding number. Note that in a multi-byte string, the number of the first byte at the beginning of "1" is the number of bytes in the entire string. The first line starts with 0 to accommodate the ASCII encoding, a byte, the second row is a double-byte string, the third behavior is 3 bytes, such as the Chinese character, and so on. (Personally think: in fact, we can simply put the number of the previous 1 as the number of bytes)

How does Unicode convert to Utf-8?

To convert Unicode to UTF-8, of course, know where their differences are. Let's take a look at how encoding in Unicode is converted to UTF-8, and in UTF-8, if the byte of one character is less than 0x80 (128), the ASCII character, which is a byte, can be converted without conversion because UTF-8 is compatible with ASCII encoding. If the character "you" in Unicode is encoded as "u4f60", convert it to binary 100111101100000 and then convert it according to the UTF-8 method. Unicode binary can be removed from the low to the high position of the binary digits, take 6 bits at a time, such as the above binary can be taken out as shown below the format, the front in accordance with the format fill, less than 8 in 0 fill.

unicode:100111101100000 4f60

Utf-8:11100100,10111101,10100000 e4bda0

From the above it is intuitive to see the conversion between Unicode to UTF-8, of course, when you know the format of the UTF-8, you can inverse it in the format of the corresponding position in the binary, and then in the conversion is the resulting Unicode characters (this operation can be done by "displacement "To complete). As with the above "you" conversion, because its value is greater than 0x800 less than 0x10000, so it can be judged to be three-byte storage, the highest bit needs to move to the right "12" bit and then to the highest value based on the 11100000 (0XE0) request or (|) of three-byte format. Similarly, the second position is to move the right "6" bit, and the binary value of the highest and second digits is left, which can be 0x3F (&) and 11000000 (|) with 111111 (). The third place does not need to be shifted, as long as the last six digits (with 111111 (ox3f) are taken &), in contact with 11000000 (0x80) or (|).

How did Utf-8 reverse back to Unicode?

Of course, the conversion of UTF-8 to Unicode is done by shifting and so on, which is to pull out the binary number of the positions corresponding to the UTF-8 those formats. In the example above, "You" is three bytes, so you have to process each byte, with high to low. In the UTF-8, "You" is 11100100,10111101,10100000. From the high level that is the first byte 11100100 is to take out the "0100", which is very simple as long as and 11111 (0x1F) Fetch and (&), from three bytes can be known to be the most in place 12 bits, because each fetch six bits. So you also have to move the resulting results left 12 digits, the highest bit of this completes the 0100,000000,000000. And the second is to take "111101" out, then only the second byte 10111101 and 111111 (0x3F) with (&). When the resulting result is shifted to the result of the left 6 bits and the highest byte, the second is done, and the result is 0100,111101,000000. And so the last one directly with 111111 (0x3F) with (&), and then with the results obtained from the previous or (|) can get results 0100,111101,100000.

PHP Code implementation:

/**
 * UTF8 character to Unicode character
 * @param [type] $utf 8_str Utf-8 character
 * @return [type]      Unicode character
/ function Utf8_str_to_unicode ($utf 8_str) {
  $unicode = 0;
  $unicode = (ord ($utf 8_str[0]) & 0x1F) <<;
  $unicode |= (Ord ($utf 8_str[1]) & 0x3F) << 6;
  $unicode |= (Ord ($utf 8_str[2]) & 0x3F);
  Return Dechex ($unicode);

/**
 * Unicode characters are converted to UTF8 characters
 * @param [type] $unicode _str Unicode characters
 * @return [type]       Utf-8 characters
 */< C19/>function Unicode_to_utf8 ($unicode _str) {
  $utf 8_str = ';
  $code = Intval (Hexdec ($unicode _str));
  Note here that the converted code must be plastic, so that the correct bitwise operation
  $ord _1 = Decbin (0xe0 | ($code >>));
  $ord _2 = Decbin (0x80 | (($code >> 6) & 0x3f));
  $ord _3 = Decbin (0x80 | ($code & 0x3f));
  $utf 8_str = Chr (Bindec ($ord _1)). Chr (Bindec ($ord _2)). Chr (Bindec ($ord _3));
  return $utf 8_str;
}

Test it.

$utf 8_str = ' I ';

This is the Unicode encoding of the Chinese character "you"
$unicode _str = ' 4f6b ';

Output 6211
echo utf8_str_to_unicode ($utf 8_str). "<br/>";

Output kanji "You"
Echo Unicode_str_to_utf8 ($unicode _str);

These conversions are for Chinese characters (non-ASCII) testing, and only support a single character "a complete utf8 character or a complete Unicode character" conversion, I hope to help you learn.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More