During the PHP code writing process, you often encounter problems that require Chinese transcoding, such as gb2312 <=> Unicode, gb2312 <=> big5, and so on. If the PHP compilation contains mbstring, you can use the multi-byte string function to partially perform transcoding. However, because many virtual hosts do not support mbstring, or the compilation and configuration of mbstring is too troublesome, many PHP code cannot use this sequence of functions.
Recently, to solve this problem, we found a good project: PhP news reader, a web-based news reader that supports NNTP-based (RFC 977) the reading, publishing, deletion, and reply functions of news articles. This project implements mutual transcoding between gb2312 big5 Unicode (UTF-8), which is what I care about.
Use the CVS client (directly use the command line in Linux, and Tortoise CVS is recommended in Windows) to check out the project code:
# cvs -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/pnews login
Logging in to :pserver:anonymous@cvs.sourceforge.net:2401/cvsroot/pnews
CVS password: (Press Enter)
# cvs -z3 -d:pserver:anonymous@cvs.sourceforge.net:/cvsroot/pnews co pnews
cvs server: Updating pnews
…
View the pnews/language Directory, which contains the following files:
big5-gb.tab
big5-unicode.tab
gb-big5.tab
gb-unicode.tab
unicode-big5.tab
unicode-gb.tab
These are code tables for character conversion, and then look at the pnews/language. Inc. php file, which contains several functions for encoding and conversion:
// Big5 => GB
function b2g( $instr ) {
$fp = fopen( 'language/big5-gb.tab', 'r' );
$len = strlen($instr);
for( $i = 0 ; $i < $len ; $i++ ) {
$h = ord($instr[$i]);
if( $h >= 160 ) {
$l = ord($instr[$i+1]);
if( $h == 161 && $l == 64 )
$gb = ' ';
else {
fseek( $fp, (($h-160)*255+$l-1)*3 );
$gb = fread( $fp, 2 );
}
$instr[$i] = $gb[0];
$instr[$i+1] = $gb[1];
$i++;
}
}
fclose($fp);
return $instr;
}
// GB => BIG5
function g2b( $instr ) {
$fp = fopen( 'language/gb-big5.tab', 'r' );
$len = strlen($instr);
for( $i = 0 ; $i < $len ; $i++ ) {
$h = ord($instr[$i]);
if( $h > 160 && $h < 248 ) {
$l = ord($instr[$i+1]);
if( $l > 160 && $l < 255 ) {
fseek( $fp, (($h-161)*94+$l-161)*3 );
$bg = fread( $fp, 2 );
}
else
$bg = ' ';
$instr[$i] = $bg[0];
$instr[$i+1] = $bg[1];
$i++;
}
}
fclose($fp);
return $instr;
}
// Big5 => Unicode(UtF-8)
function b2u( $instr ) {
$fp = fopen( 'language/big5-unicode.tab', 'r' );
$len = strlen($instr);
$outstr = '';
for( $i = $x = 0 ; $i < $len ; $i++ ) {
$h = ord($instr[$i]);
if( $h >= 160 ) {
$l = ord($instr[$i+1]);
if( $h == 161 && $l == 64 )
$uni = ' ';
else {
fseek( $fp, ($h-160)*510+($l-1)*2 );
$uni = fread( $fp, 2 );
}
$codenum = ord($uni[0])*256 + ord($uni[1]);
if( $codenum < 0x800 ) {
$outstr[$x++] = chr( 192 + $codenum / 64 );
$outstr[$x++] = chr( 128 + $codenum % 64 );
#printf("[%02X%02X]<br>n", ord($outstr[$x-2]), ord($uni[$x-1]) );
}
else {
$outstr[$x++] = chr( 224 + $codenum / 4096 );
$codenum %= 4096;
$outstr[$x++] = chr( 128 + $codenum / 64 );
$outstr[$x++] = chr( 128 + ($codenum % 64) );
#printf("[%02X%02X%02X]<br>n", ord($outstr[$x-3]), ord($outstr[$x-2]), ord($outstr[$x-1]) );
}
$i++;
}
else
$outstr[$x++] = $instr[$i];
}
fclose($fp);
if( $instr != '' )
return join( '', $outstr);
}
// Unicode(UTF-8) => BIG5
function u2b( $instr ) {
$fp = fopen( 'language/unicode-big5.tab', 'r' );
$len = strlen($instr);
$outstr = '';
for( $i = $x = 0 ; $i < $len ; $i++ ) {
$b1 = ord($instr[$i]);
if( $b1 < 0x80 ) {
$outstr[$x++] = chr($b1);
#printf( "[%02X]", $b1);
}
elseif( $b1 >= 224 ) {# 3 bytes UTF-8
$b1 -= 224;
$b2 = ord($instr[$i+1]) - 128;
$b3 = ord($instr[$i+2]) - 128;
$i += 2;
$uc = $b1 * 4096 + $b2 * 64 + $b3 ;
fseek( $fp, $uc * 2 );
$bg = fread( $fp, 2 );
$outstr[$x++] = $bg[0];
$outstr[$x++] = $bg[1];
#printf( "[%02X%02X]", ord($bg[0]), ord($bg[1]));
}
elseif( $b1 >= 192 ) {# 2 bytes UTF-8
printf( "[%02X%02X]", $b1, ord($instr[$i+1]) );
$b1 -= 192;
$b2 = ord($instr[$i]) - 128;
$i++;
$uc = $b1 * 64 + $b2 ;
fseek( $fp, $uc * 2 );
$bg = fread( $fp, 2 );
$outstr[$x++] = $bg[0];
$outstr[$x++] = $bg[1];
#printf( "[%02X%02X]", ord($bg[0]), ord($bg[1]));
}
}
fclose($fp);
if( $instr != '' ) {
#echo '##' . $instr . " becomes " . join( '', $outstr) . "<br>n";
return join( '', $outstr);
}
}
// GB => Unicode(UTF-8)
function g2u( $instr ) {
$fp = fopen( 'language/gb-unicode.tab', 'r' );
$len = strlen($instr);
$outstr = '';
for( $i = $x = 0 ; $i < $len ; $i++ ) {
$h = ord($instr[$i]);
if( $h > 160 ) {
$l = ord($instr[$i+1]);
fseek( $fp, ($h-161)*188+($l-161)*2 );
$uni = fread( $fp, 2 );
$codenum = ord($uni[0])*256 + ord($uni[1]);
if( $codenum < 0x800 ) {
$outstr[$x++] = chr( 192 + $codenum / 64 );
$outstr[$x++] = chr( 128 + $codenum % 64 );
#printf("[%02X%02X]<br>n", ord($outstr[$x-2]), ord($uni[$x-1]) );
}
else {
$outstr[$x++] = chr( 224 + $codenum / 4096 );
$codenum %= 4096;
$outstr[$x++] = chr( 128 + $codenum / 64 );
$outstr[$x++] = chr( 128 + ($codenum % 64) );
#printf("[%02X%02X%02X]<br>n", ord($outstr[$x-3]), ord($outstr[$x-2]), ord($outstr[$x-1]) );
}
$i++;
}
else
$outstr[$x++] = $instr[$i];
}
fclose($fp);
if( $instr != '' )
return join( '', $outstr);
}
// Unicode(UTF-8) => GB
function u2g( $instr ) {
$fp = fopen( 'language/unicode-gb.tab', 'r' );
$len = strlen($instr);
$outstr = '';
for( $i = $x = 0 ; $i < $len ; $i++ ) {
$b1 = ord($instr[$i]);
if( $b1 < 0x80 ) {
$outstr[$x++] = chr($b1);
#printf( "[%02X]", $b1);
}
elseif( $b1 >= 224 ) {# 3 bytes UTF-8
$b1 -= 224;
$b2 = ord($instr[$i+1]) - 128;
$b3 = ord($instr[$i+2]) - 128;
$i += 2;
$uc = $b1 * 4096 + $b2 * 64 + $b3 ;
fseek( $fp, $uc * 2 );
$gb = fread( $fp, 2 );
$outstr[$x++] = $gb[0];
$outstr[$x++] = $gb[1];
#printf( "[%02X%02X]", ord($gb[0]), ord($gb[1]));
}
elseif( $b1 >= 192 ) {# 2 bytes UTF-8
printf( "[%02X%02X]", $b1, ord($instr[$i+1]) );
$b1 -= 192;
$b2 = ord($instr[$i]) - 128;
$i++;
$uc = $b1 * 64 + $b2 ;
fseek( $fp, $uc * 2 );
$gb = fread( $fp, 2 );
$outstr[$x++] = $gb[0];
$outstr[$x++] = $gb[1];
#printf( "[%02X%02X]", ord($gb[0]), ord($gb[1]));
}
}
fclose($fp);
if( $instr != '' ) {
#echo '##' . $instr . " becomes " . join( '', $outstr) . "<br>n";
return join( '', $outstr);
}
}
To transcode your php file, you only need the. Tab code table file and the corresponding transcoding function to change the file path opened by fopen in the function to the correct path.