PHP achieves Unicode and Utf-8 mutual conversion

Last Update:2017-05-14 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

PHP achieves Unicode and Utf-8 mutual conversion I. coding principles and implementation

Unicode encoding is the basis for UTF-8 and gb series encoding (gb2312, gbk, and gb18030) Conversion. although we can also directly create a UTF-8-to-these encoding table, but few people will do this, because the variable encoding of UTF-8 is uncertain, so the general use of unicode and gb encoding in contrast to the table, unicode (UCS-2) is actually the basic encoding of UTF-8, UTF-8 is only one of its implementations. there is a corresponding relationship between the two:

Unicode symbol range | UTF-8 encoding method

U00000 0000-u0000 007F | 0 xxxxxxx

U0000 0080-u0000 07FF | 110 xxxxx 10 xxxxxx

U0000 0800-u0000 FFFF | 1110 xxxx 10 xxxxxx 10 xxxxxx

Because currently UTF-8 characters are in the UCS-2, it is unnecessary to consider the situation of 4-6 bytes encoding, similarly, in reverse conversion, if UTF-8 characters of more than four bytes exist, they can be directly considered as garbled characters or converted to the unicode entity form ("& # long int ), then it is handed over to the browser or related parsing programs for processing. The following is an algorithm for converting unicode into UTF-8 encoding using php:

* The $ c parameter is an int-type value encoded with unicode characters. if data is read in binary format, it is usually converted using hexdec (bin2hex ($ bin_unichar) in php.

Function uni2utf8 ($ c)

{

If ($ c <0x80)

{

$ Utf8char = chr ($ c );

}

Else if ($ c <0x800)

{

$ Utf8char = chr (0xC0 | $ c> 0x06). chr (0x80 | $ c & 0x3F );

}

Else if ($ c <0x10000)

{

$ Utf8char = chr (0xE0 | $ c> 0x0C ). chr (0x80 | $ c> 0x06 & 0x3F ). chr (0x80 | $ c & 0x3F );

}

// Because the UCS-2 only has two bytes, the subsequent situation is not possible, here only describes the unicode HTML entity encoding usage.

Else

{

$ Utf8char = "& # {$ c };";

}

Return $ utf8char;

}

In the current environment range, we can consider the UTF-8 character set = unicode (UCS-2), but theoretically, the main character set of the inclusion relationship is as follows:

UTF-8> unicode (UCS-2)> gb18030> gbk> gb2312

Therefore, if the encoding is correct:

Gb2312 => gbk => gb18030 => unicode (UCS-2) => UTF-8

Such a transformation process is basically lossless, but instead

UTF-8 => unicode (UCS-2) => gb18030 => gbk => gb2312

Such a transformation process is likely to have unrecognized characters. Therefore, if UTF-8 encoding is used, do not perform reverse conversion and encoding as easily as possible.

2. another way to convert Unicode to UTF-8 with PHP:

Function unescape ($ str ){

$ Str = rawurldecode ($ str );

Preg_match_all ("/(? : % U. {4}) | & # x. {4}; | & # \ d +; |. +/U ", $ str, $ r );

$ Ar = $ r [0];

// Print_r ($ ar );

Foreach ($ ar as $ k => $ v ){

If (substr ($ v, 0, 2) = "% u "){

$ Ar [$ k] = iconv ("UCS-2BE", "UTF-8", pack ("H4", substr ($ v,-4 )));

}

Elseif (substr ($ v, 0, 3) = "& # x "){

$ Ar [$ k] = iconv ("UCS-2BE", "UTF-8", pack ("H4", substr ($ v, 3,-1 )));

}

Elseif (substr ($ v, 0, 2) = "&#"){

$ Ar [$ k] = iconv ("UCS-2BE", "UTF-8", pack ("n", substr ($ v, 2,-1 )));

}

Return join ("", $ ar );

}

The UCS-2 encoding method on a Linux server is inconsistent with Winodws, and the following are the unspoken rules for the two platforms UCS-2 encoding:

1. the UCS-2 is not equal to the UTF-16. Each byte in the UTF-16 uses ASCII character range encoding, while the UCS-2 can encode each byte beyond the ASCII character range. UCS-2 and UTF-16 take up to two bytes for each character, but their encoding is different.

2. for UCS-2, the default is UCS-2LE in windows. The unicode of the UCS-2LE is generated with MultibyteToWidechar (or A2W. Windows Notepad can save text as a UCS-2BE, which is equivalent to a layer conversion.

3. for UCS-2, the default is UCS-2BE in linux. Iconv (specifies the UCS-2) is used to convert the unicode of the UCS-2BE. If you convert a UCS-2 from a windows platform, you need to specify a UCS-2LE.

4. In view of windows, linux and other platforms on the UCS-2 of different understanding (UCS-2LE, UCS-2BE ). MS advocates unicode has a bootstrap sign (UCS-2LE FFFE, UCS-2BE FEFF) to indicate that the following characters are unicode and identify big-endian or little-endian. Therefore, data from the windows platform has this prefix, so you don't need to worry.

5. linux Encoding output, such as output from a file and output from printf, requires proper encoding matching on the console (if the encoding does not match, it is generally related to the encoding during compilation of the program ), the conversion input in the console needs to view the current system code. For example, the current console encoding is UTF-8, then the UTF-8 encoding can be correctly displayed, GBK can not; similarly, the current encoding is GBK, you can display GBK encoding, later, the system should be more intelligent to handle more transformations. However, through putty and other terminals, you still need to set the terminal encoding conversion to relieve garbled characters.

3. provide a complete set of UNICODE encoding and decoding for Chinese characters in PHP for your reference:

// Encode the content in UNICODE

Function unicode_encode ($ name)

{

$ Name = iconv ('utf-8', 'ucs-2', $ name );

$ Len = strlen ($ name );

$ Str = '';

For ($ I = 0; $ I <$ len-1; $ I = $ I + 2)

{

$ C = $ name [$ I];

$ C2 = $ name [$ I + 1];

If (ord ($ c)> 0)

{

// Two-byte text

$ Str. = '\ U'. base_convert (ord ($ c), 10, 16). base_convert (ord ($ c2), 10, 16 );

}

Else

{

$ Str. = $ c2;

}

Return $ str;

}

// Decodes the UNICODE encoded content

Function unicode_decode ($ name)

{

// Convert the Unicode encoding to the UTF-8 encoding that can be viewed

$ Pattern = '/([\ w] +) | (\ u ([\ w] {4})/I ';

Preg_match_all ($ pattern, $ name, $ matches );

If (! Empty ($ matches ))

{

$ Name = '';

For ($ j = 0; $ j <count ($ matches [0]); $ j ++)

{

$ Str = $ matches [0] [$ j];

If (strpos ($ str, '\ u') = 0)

{

$ Code = base_convert (substr ($ str, 2, 2), 16, 10 );

$ Code2 = base_convert (substr ($ str, 4), 16, 10 );

$ C = chr ($ code). chr ($ code2 );

$ C = iconv ('ucs-2', 'utf-8', $ c );

$ Name. = $ c;

}

Else

{

$ Name. = $ str;

}

Return $ name;

}

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

PHP achieves Unicode and Utf-8 mutual conversion

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

PHP achieves Unicode and Utf-8 mutual conversion

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support