Convert Unicode to UTF-8 with PHP

Last Update:2017-05-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

: This article mainly introduces how to use PHP to convert Unicode to a UTF-8, for PHP tutorials interested in students can refer. Function unescape ($ str) {$ str = rawurldecode ($ str );

Preg_match_all ("/(? : % U. {4}) | & # x. {4}; | & # \ d +; |. +/U ", $ str, $ r); $ ar = $ r [0]; // print_r ($ ar ); foreach ($ ar as $ k = >$ v) {if (substr ($ v,) = "% u ") {$ ar [$ k] = iconv ("UCS-2BE", "UTF-8", pack ("H4", substr ($ v,-4 )));} elseif (substr ($ v,) = "& # x") {$ ar [$ k] = iconv ("UCS-2BE", "UTF-8 ", pack ("H4", substr ($ v, 3,-1);} elseif (substr ($ v,) = "&#") {$ ar [$ k] = iconv ("UCS-2BE", "UTF-8", pack ("n", substr ($ v, 2,-1 )));}} returnjoin ("", $ ar);} echo unescape ("Purple Star Blue ");

Some users reported that the data submitted by users in the form system is garbled in Chinese. The test found that the problem lies in the iconv conversion.
Iconv ('ucs-2', 'gbk', 'Chinese ')
Google search found that the reason is that the UCS-2 encoding method on the Linux server is inconsistent with Winodws.
So I changed it to iconv ('ucs-2be ', 'gbk', and 'Chinese'). the Chinese language is normal.

The following are the unspoken rules for UCS-2 coding on both platforms:

1, the UCS-2 is not equal to the UTF-16. Each byte in the UTF-16 uses ASCII character range encoding, while the UCS-2 can encode each byte beyond the ASCII character range. UCS-2 and UTF-16 take up to two bytes for each character, but their encoding is different.

2, for UCS-2, windows is the default UCS-2LE. The unicode of the UCS-2LE is generated with MultibyteToWidechar (or A2W. Windows Notepad can save text as a UCS-2BE, which is equivalent to a layer conversion.

3, for UCS-2, the default is UCS-2BE in linux. Iconv (specifies the UCS-2) is used to convert the unicode of the UCS-2BE. If you convert a UCS-2 from a windows platform, you need to specify a UCS-2LE.

4, in view of windows and linux and other platforms on the UCS-2 of different understanding (UCS-2LE, UCS-2BE ). MS advocates unicode has a bootstrap sign (UCS-2LE FFFE, UCS-2BE FEFF) to indicate that the following characters are unicode and identify big-endian or little-endian. Therefore, data from the windows platform has this prefix, so you don't need to worry.

5. linux Encoding output, such as output from a file and output from printf, requires proper encoding matching on the console (if the encoding does not match, generally, it is related to the encoding during compilation of the program), and the conversion input in the console needs to view the current system encoding. For example, the current console encoding is UTF-8, then the UTF-8 encoding can be correctly displayed, GBK can not; similarly, the current encoding is GBK, you can display GBK encoding, later, the system should be more intelligent to handle more transformations. However, through putty and other terminals, you still need to set the terminal encoding conversion to relieve garbled characters.

Implementation of UNICODE encoding and decoding for Chinese characters in PHP

// Encode the content in UNICODE: functionunicode_encode ($ name) {$ name = iconv ('utf-8', 'ucs-2', $ name ); $ len = strlen ($ name); $ str = ''; for ($ I = 0; $ I <$ len-1; $ I = $ I + 2) {$ c = $ name [$ I]; $ c2 = $ name [$ I + 1]; if (ord ($ c)> 0) {// two-byte text $ str. = '\ U '. base_convert (ord ($ c), 10, 16 ). base_convert (ord ($ c2), 10, 16);} else {$ str. = $ c2 ;}} return $ str ;}$ name = 'My, your uncle's '; $ unicode_name = unicode_encode ($ name); ec Ho ''. $ unicode_name. ''; // decode the UNICODE encoded content functionunicode_decode ($ name) {// Convert the encoding, convert Unicode encoding to UTF-8 encoding that can be viewed $ pattern = '/([\ w] +) | (\ u ([\ w] {4 })) /I '; preg_match_all ($ pattern, $ name, $ matches); if (! Empty ($ matches) {$ name = ''; for ($ j = 0; $ j <count ($ matches [0]); $ j ++) {$ str = $ matches [0] [$ j]; if (strpos ($ str, '\ u') = 0) {$ code = base_convert (substr ($ str, 2, 2), 16, 10); $ code2 = base_convert (substr ($ str, 4), 16, 10 ); $ c = chr ($ code ). chr ($ code2); $ c = iconv ('ucs-2', 'utf-8', $ c); $ name. = $ c;} else {$ name. = $ str ;}}return $ name;} echo 'My, \ u4f60 \ u5927 \ u7237 \ u7684-> '. unicode_decode ($ unicode_name );

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Convert Unicode to UTF-8 with PHP

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Convert Unicode to UTF-8 with PHP

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support