Early on, I found a reference table (gb2312.txt) for converting Gbit/s to 8 Gbit/s, and used it to output Chinese Characters in GD. When the content to be output contains Spanish characters, confusion may occur. Later, I found the modified Code and solved the problem. The comparison between the two functions is as follows.
First, this is a UNICODE to UTF-8 encoding conversion function, this part of the changes have not changed before and after:
Function u2utf8 ($ c)
{
For ($ I = 0; $ I <count ($ c); $ I ++)
$ Str = "";
If ($ c <0x80 ){
$ Str. = $ c;
}
Else if ($ c <0x800 ){
$ Str. = (0xC0 | $ c> 6 );
$ Str. = (0x80 | $ c & 0x3F );
}
Else if ($ c <0x10000 ){
$ Str. = (0xE0 | $ c> 12 );
$ Str. = (0x80 | $ c> 6 & 0x3F );
$ Str. = (0x80 | $ c & 0x3F );
}
Else if ($ c <0x200000 ){
$ Str. = (0xF0 | $ c> 18 );
$ Str. = (0x80 | $ c> 12 & 0x3F );
$ Str. = (0x80 | $ c> 6 & 0x3F );
$ Str. = (0x80 | $ c & 0x3F );
}
Return $ str;
}
Here it is completely according to the UTF-8 encoding rules, by judging the character belongs to different UNICODE encoding segment range, different shift and bit and operation, to convert to UTF-8 encoding. For details about this rule, refer to the instructions on http://www.utf8.org.
This is the function for converting the previous GB to UTF-8 encoding, where the above u2utf8 function is called.
Function gb2utf8 ($ gb)/* Program writen by sadly www.phpx.com */
{
If (! Trim ($ gb ))
Return $ gb;
$ Filename = "gb2312.txt ";
$ Tmp = file ($ filename );
$ Codetable = array ();
While (list ($ key, $ value) = each ($ tmp ))
$ Codetable [hexdec (substr ($ value,)] = substr ($ value );
$ Utf8 = "";
While ($ gb)
{
If (ord (substr ($ gb, 127)>)
{
$ This = substr ($ gb, 0, 2 );
$ Gb = substr ($ gb, 2, strlen ($ gb ));
$ Utf8. = u2utf8 (hexdec ($ codetable [hexdec (bin2hex ($ this)-0x8080]);
}
Else
{
$ Gb = substr ($ gb, 1, strlen ($ gb ));
$ Utf8. = u2utf8 (substr ($ gb, 0, 1 ));
}
}
$ Ret = "";
For ($ I = 0; $ I <strlen ($ utf8); $ I + = 3)
$ Ret. = chr (substr ($ utf8, $ I, 3 ));
Return $ ret;
}
In the while LOOP part of the function, convert Chinese characters to UNICODE one by one according to the "comparison table", and then convert to UTF-8 through the u2utf8 function. But it can be seen that after the while loop is over, another for loop, every three bytes into a UTF-8 character (see http://www.utf8.org/on the regular instructions, each 8 bytes of the Chinese character is three bytes ), the Spanish character is not taken into account (the UTF-8 of the Spanish character is encoded as a byte ). Therefore, if the content to be output, whether at the beginning of the occurrence of Spanish characters, or Chinese characters interspersed with Spanish characters, after conversion to UTF-8, it will be intercepted by "every three bytes", leading to garbled characters.
The modified functions are as follows:
Function gb2utf8 ($ gb)/* Program writen by sadly modified by agun */
{
If (! Trim ($ gb ))
Return $ gb;
$ Filename = "gb2312.txt ";
$ Tmp = file ($ filename );
$ Codetable = array ();
While (list ($ key, $ value) = each ($ tmp ))
$ Codetable [hexdec (substr ($ value,)] = substr ($ value );
$ Ret = "";
$ Utf8 = "";
While ($ gb)
{
If (ord (substr ($ gb, 127)>)
{
$ This = substr ($ gb, 0, 2 );
$ Gb = substr ($ gb, 2, strlen ($ gb ));
$ Utf8 = u2utf8 (hexdec ($ codetable [hexdec (bin2hex ($ this)-0x8080]);
For ($ I = 0; $ I <strlen ($ utf8); $ I + = 3)
$ Ret. = chr (substr ($ utf8, $ I, 3 ));
}
Else
{
$ Ret. = substr ($ gb, 0, 1 );
$ Gb = substr ($ gb, 1, strlen ($ gb ));
}
}
Return $ ret;
}
The modified function converts GB to UNICODE, UNICODE to UTF-8, several bytes to synthesize a UTF-8 character, and these three steps are done in a loop, in particular, several bytes to synthesize a UTF-8 character, in the judgment of the character belongs to the west or belongs to the Chinese character of the condition branch, thus determining whether to intercept a byte or three bytes. The result is correct!