If it's just a Unicode utf-8 encoded algorithm, the internet is everywhere, but a lot of people are you copy me, I copy you, do not understand why and do, in addition to the simplest PHP for Unicode transcoding utf-8 encoding functions, but also in-depth discussion of the two coding relationship, Understand that some of the old things on the internet, are seriously redundant and outdated, because from the beginning of the utf-8 popular to now, has been from the original six-byte variable encoding to the actual completely in the Unicode (UCS-2) stable phase.
Unicode encoding is the basis for implementing Utf-8 and GB series encoding (GB2312, GBK, GB18030), although we can also directly do a comparison of Utf-8 to these encodings, but very few people do so because the variable coding of utf-8 is uncertain, Therefore, the general use of Unicode and GB encoded in the table, Unicode (UCS-2) is actually the utf-8 of the base code, UTF-8 is just a realization of it, the two have the following corresponding relationship:
Unicode Symbol Range | UTF-8 Encoding method
u0000 0000-u0000 007F | 0xxxxxxx
u0000 0080-u0000 07FF | 110xxxxx 10xxxxxx
u0000 0800-u0000 FFFF | 1110xxxx 10xxxxxx 10xxxxxx
Because the current utf-8 use of characters are in the UCS-2, so for the 4-6-byte encoding situation is not necessary to consider, the same, in the reverse conversion, if the occurrence of more than four bytes of utf-8 characters, can be directly considered garbled ignore or become Unicode entity form ("&# long int; " form), and then to the browser or related parser to deal with, using PHP to convert Unicode to utf-8 encoding algorithm is as follows:
The
/*
* parameter $c is a numeric value of type int for Unicode character encodings, and if data is read in binary form, it is typically used in PHP Hexdec (Bin2Hex ($bin _unichar)) to convert
*/
Function Uni2utf8 ($c)
{
if ($c < 0x80)
{
$utf 8char = Chr ($c);
}
else if ($c < 0x800)
{
$utf 8char = Chr (0xc0 | $c >> 0x06). chr (0x80 | $c & 0x3F);
}
else if ($c < 0x10000)
{
$utf 8char = Chr (0xe0 | $c >> 0x0C). chr (0x80 | $c >> 0x06 & 0x3F) chr (0x80 | $c & 0x3F);
}
//because the UCS-2 is only two bytes, so the following situation is not possible, this is just a description of the use of Unicode HTML entity encoding.
else
{
$utf 8char = "&#{$c};";
}
return $utf 8char;
}
Within the current environment, the Utf-8 character set ==unicode (UCS-2) can be considered, but theoretically the inclusion relationship of the main character set is as follows:
Utf-8 > Unicode (UCS-2) > GB18030 > GBK > gb2312
So if the code is correct,
gb2312 => gbk => gb18030 => Unicode UCS-2 =>
Such a process of transformation is essentially non-destructive, but in contrast, by
Utf-8 => Unicode (UCS-2) => gb18030=> GBK =>
Such a transformation process is likely to have unrecognized characters, so if you want to use the Utf-8 coding system, try not to do the reverse conversion code easily.