function | The Chinese character has long ago found a function that converts the GB code into UTF-8, with a GB to Unicode comparison table (gb2312.txt) for outputting Chinese characters in GD. It was later found that there was confusion in the content to be exported that contained Western characters. Later found the modified code to solve the problem. The two functions are compared and analyzed as follows.
First, this is a Unicode to UTF-8 encoding conversion function, which is unchanged before and after the modification:
function U2utf8 ($c)
{
for ($i =0; $i <count ($c); $i + +)
$str = "";
if ($c < 0x80) {
$str. = $c;
}
else if ($c < 0x800) {
$str. = (0xc0 | $c >>6);
$str. = (0x80 | $c & 0x3F);
}
else if ($c < 0x10000) {
$str. = (0xe0 | $c >>12);
$str. = (0x80 | $c >>6 & 0x3F);
$str. = (0x80 | $c & 0x3F);
}
else if ($c < 0x200000) {
$str. = (0xF0 | $c >>18);
$str. = (0x80 | $c >>12 & 0x3F);
$str. = (0x80 | $c >>6 & 0x3F);
$str. = (0x80 | $c & 0x3F);
}
return $str;
}
This is entirely in accordance with the rules of UTF-8 coding, by judging the characters belong to different Unicode coding range, to carry out different shifts and bits and operations, to convert to UTF-8 encoding. The rule can refer to the instructions on the http://www.utf8.org/.
This is the modified GB conversion to UTF-8 encoded function, which invokes the above U2utf8 function.
function Gb2utf8 ($GB)/* program writen by sadly Www.phpx.com * *
{
if (!trim ($GB))
return $GB;
$filename = "Gb2312.txt";
$tmp =file ($filename);
$codetable =array ();
while (list ($key, $value) =each ($tmp))
$codetable [Hexdec (substr ($value, 0,6))]=substr ($value, 7,6);
$utf 8 = "";
while ($GB)
{
if (Ord (substr ($GB, 0,1)) >127)
{
$this =substr ($GB, 0,2);
$GB =substr ($GB, 2,strlen ($GB));
$utf 8.=u2utf8 (Hexdec ($codetable [Hexdec (Bin2Hex ($this)) -0x8080]);
}
Else
{
$GB =substr ($GB, 1,strlen ($GB));
$utf 8.=u2utf8 (substr ($GB, 0, 1));
}
}
return $ret;
}
In the function of while loop, the Chinese characters are converted to Unicode according to the "comparison table" and then transformed into UTF-8 by U2utf8 function. However, it can be seen that the while loop ends with a for loop that synthesizes each three byte into a UTF-8 character (see the rule description on http://www.utf8.org/, the UTF-8 encoding for each Chinese character is three bytes), No consideration is given to the West character (the UTF-8 encoding of the western character is one byte). Therefore, if you want to output content, whether it is the beginning of the West character, or Chinese characters interspersed with Western characters, converted to UTF-8, will be in accordance with the "intercept every three bytes" way, resulting in garbled.
Here is the modified function:
function Gb2utf8 ($GB)/* program writen by sadly modified by Agun *
{
if (!trim ($GB))
return $GB;
$filename = "Gb2312.txt";
$tmp =file ($filename);
$codetable =array ();
while (list ($key, $value) =each ($tmp))
$codetable [Hexdec (substr ($value, 0,6))]=substr ($value, 7,6);
The modified function converts GB to Unicode, Unicode to UTF-8, several bytes to a UTF-8 character, and these three steps are completed in a loop, especially when several bytes synthesize a UTF-8 character, In a conditional branch that determines whether a character belongs to Western or Chinese characters, it is decided to intercept a byte or three bytes. So the results are correct!
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.