Early on, I found a reference table (gb2312.txt) for converting Gbit/s to 8 Gbit/s, and used it to output Chinese Characters in GD. When the content to be output contains Spanish characters, confusion may occur. Later I found the modified Code . The comparison between the two functions is as follows.
First, this is a Unicode to UTF-8 encoding conversion function, this part of the changes have not changed before and after:
Function u2utf8 ($ C)
{
For ($ I = 0; $ I <count ($ C); $ I ++)
$ STR = "";
If ($ C <0x80 ){
$ Str. = $ C;
}
Else if ($ C <0x800 ){
$ Str. = (0xc0 | $ C> 6 );
$ Str. = (0x80 | $ C & 0x3f );
}
Else if ($ C <0x10000 ){
$ Str. = (0xe0 | $ C> 12 );
$ Str. = (0x80 | $ C> 6 & 0x3f );
$ Str. = (0x80 | $ C & 0x3f );
}
Else if ($ C <0x200000 ){
$ Str. = (0xf0 | $ C> 18 );
$ Str. = (0x80 | $ C> 12 & 0x3f );
$ Str. = (0x80 | $ C> 6 & 0x3f );
$ Str. = (0x80 | $ C & 0x3f );
}
Return $ STR;
}
Here it is completely according to the UTF-8 encoding rules, by judging the character belongs to different unicode encoding segment range, different shift and bit and operation, to convert to UTF-8 encoding. For details about this rule, refer to the instructions on http://www.utf8.org.
This is the function for converting the previous GB to UTF-8 encoding, where the above u2utf8 function is called.
Function gb2utf8 ($ GB)/* program writen by sadly www.phpx.com */
{
If (! Trim ($ GB ))
Return $ GB;
$ Filename = "gb2312.txt ";
$ TMP = file ($ filename );
$ Codetable = array ();
While (List ($ key, $ value) = each ($ TMP ))
$ Codetable [hexdec (substr ($ value,)] = substr ($ value );
$ Utf8 = "";
While ($ GB)
{
If (ord (substr ($ GB, 127)>)
{
$ This = substr ($ GB, 0, 2 );
$ GB = substr ($ GB, 2, strlen ($ GB ));
$ Utf8. = u2utf8 (hexdec ($ codetable [hexdec (bin2hex ($ this)-0x8080]);
}
Else
{
$ GB = substr ($ GB, 1, strlen ($ GB ));
$ Utf8. = u2utf8 (substr ($ GB, 0, 1 ));
}
}
$ Ret = "";
For ($ I = 0; $ I <strlen ($ utf8); $ I + = 3)
$ Ret. = CHR (substr ($ utf8, $ I, 3 ));
Return $ ret;
}
In the while loop part of the function, convert Chinese characters to Unicode one by one according to the "comparison table", and then convert to UTF-8 through the u2utf8 function. But it can be seen that after the while loop is over, another for loop, every three bytes into a UTF-8 character (see http://www.utf8.org/on the regular instructions, each 8 bytes of the Chinese character is three bytes ), the Spanish character is not taken into account (the UTF-8 of the Spanish character is encoded as a byte ). Therefore, if the content to be output, whether at the beginning of the occurrence of Spanish characters, or Chinese characters interspersed with Spanish characters, after conversion to UTF-8, it will be intercepted by "every three bytes", leading to garbled characters.
The modified functions are as follows:
Function gb2utf8 ($ GB)/* program writen by sadly modified by agun */
{
If (! Trim ($ GB ))
Return $ GB;
$ Filename = "gb2312.txt ";
$ TMP = file ($ filename );
$ Codetable = array ();
While (List ($ key, $ value) = each ($ TMP ))
$ Codetable [hexdec (substr ($ value,)] = substr ($ value );
$ Ret = "";
$ Utf8 = "";
While ($ GB)
{
If (ord (substr ($ GB, 127)>)
{
$ This = substr ($ GB, 0, 2 );
$ GB = substr ($ GB, 2, strlen ($ GB ));
$ Utf8 = u2utf8 (hexdec ($ codetable [hexdec (bin2hex ($ this)-0x8080]);
For ($ I = 0; $ I <strlen ($ utf8); $ I + = 3)
$ Ret. = CHR (substr ($ utf8, $ I, 3 ));
}
Else
{
$ Ret. = substr ($ GB, 0, 1 );
$ GB = substr ($ GB, 1, strlen ($ GB ));
}
}
Return $ ret;
}
The modified function converts GB to Unicode, Unicode to UTF-8, several bytes to synthesize a UTF-8 character, and these three steps are done in a loop, in particular, several bytes to synthesize a UTF-8 character, in the judgment of the character belongs to the west or belongs to the Chinese character of the condition branch, thus determining whether to intercept a byte or three bytes. The result is correct!