Dynamic conversion of GB encoding to UTF-8 encoding with PHP

Source: Internet
Author: User
Tags foreach explode fread mysql query ord strlen trim mysql database

Using the IP2ADDR function to read the IP database files directly is the most efficient, and SQL query is the most efficient one, compared with the MySQL database to store the IP data. However, the IP database file QQWry.dat is GB2312 encoded. Now I need UTF-8 coded geographic results. If you use the MySQL method, you can convert the data to the UTF-8 code when it is stored in the database. But QQWry.dat file can not be modified, only the output of the IP2ADDR function can be dynamically converted.

There are at least four ways to dynamically convert GB->UTF-8 encoding:

Iconv extension Conversion with PHP

mb_string extension Conversion with PHP

Swap table to store in MySQL database

Swap table to store in a text file

The first two methods are required for the server to be set up (the appropriate extensions are compiled and installed) to be used. My virtual host does not have these two extensions, I have to consider the latter two methods. The first two methods are not evaluated in this article.

The evaluation procedure is as follows (func_ip.php see "Assessment of IP address-> Geographical transformation"):

<?php
Require_once ("func_ip.php");
function U2utf8 ($c) {
$str = "";
if ($c < 0x80) {
$str. = $c;
} elseif ($c < 0x800) {
$str. = Chr (0xc0 | $c >> 6);
$str. = Chr (0x80 | $c & 0x3F);
} elseif ($c < 0x10000) {
$str. = Chr (0xe0 | $c >> 12);
$str. = Chr (0x80 | $c >> 6 & 0x3F);
$str. = Chr (0x80 | $c & 0x3F);
} elseif ($c < 0x200000) {
$str. = Chr (0xF0 | $c >> 18);
$str. = Chr (0x80 | $c >> & 0x3F);
$str. = Chr (0x80 | $c >> 6 & 0x3F);
$str. = Chr (0x80 | $c & 0x3F);
}
return $str;
}
function Gb2utf8_sql ($strGB) {
if (!trim ($strGB)) return $strGB;
$strRet = "";
$intLen = strlen ($strGB);
for ($i = 0; $i < $intLen; $i + +) {
if (Ord ($strGB {$i}) > 127) {
$strCurr = substr ($strGB, $i, 2);
$intGB = Hexdec (Bin2Hex ($strCurr))-0x8080;
$STRSQL = "Select Code_unicode from Nnstats_gb_unicode
WHERE CODE_GB = ". $intGB." LIMIT 1 "
;
$resResult = mysql_query ($STRSQL);
if ($arrCode = mysql_fetch_array ($resResult)) $strRet. = U2utf8 ($arrCode ["Code_unicode"]);
else $strRet. = "???";
$i + +;
} else {
$strRet. = $strGB {$i};
}
}
return $strRet;
}
function Gb2utf8_file ($strGB) {
if (!trim ($strGB)) return $strGB;
$arrLines = File ("Gb_unicode.txt");
foreach ($arrLines as $strLine) {
$arrCodeTable [Hexdec (substr ($strLine, 0, 6)] = Hexdec (substr ($strLine, 7, 6));
}
$strRet = "";
$intLen = strlen ($strGB);
for ($i = 0; $i < $intLen; $i + +) {
if (Ord ($strGB {$i}) > 127) {
$strCurr = substr ($strGB, $i, 2);
$intGB = Hexdec (Bin2Hex ($strCurr))-0x8080;
if ($arrCodeTable [$intGB]) $strRet. = U2utf8 ($arrCodeTable [$intGB]);
else $strRet. = "???";
$i + +;
} else {
$strRet. = $strGB {$i};
}
}
return $strRet;
}
function Encodeip ($strDotquadIp) {
$arrIpSep = Explode ('. ', $strDotquadIp);
if (count ($ARRIPSEP)!= 4) return 0;
$intIp = 0;
foreach ($arrIpSep as $k => $v) $intIp + + (int) $v * POW (256, 3-$k);
return $intIp;
return sprintf ('%02x%02x%02x%02x ', $arrIpSep [0], $arrIpSep [1], $ARRIPSEP [2], $ARRIPSEP [3]);
}
function Getmicrotime () {
List ($msec, $sec) = Explode ("", Microtime ());
Return (double) $msec + (double) $sec);
}
for ($i = 0; $i < $i + +) {//randomly generate 100 IP addresses
$strIp = Mt_rand (0, 255). ".". Mt_rand (0, 255). ".". Mt_rand (0, 255). ".". Mt_rand (0, 255);
$arrAddr [$i] = Ip2addr (Encodeip ($strIp));
}
$resConn = mysql_connect ("localhost", "netnest", "netnest");
mysql_select_db ("test");
Encoding conversion for evaluating MySQL queries
$dblTimeStart = Getmicrotime ();
for ($i = 0; $i < $i + +) {
$strUTF 8Region = Gb2utf8_sql ($arrAddr [$i] ["region"]);
$strUTF 8Address = Gb2utf8_sql ($arrAddr [$i] ["address"]);
}
$dblTimeDuration = Getmicrotime ()-$dblTimeStart;
Evaluation end and output results
Echo $dblTimeDuration; echo "\ r \ n";
Encoding conversion for evaluating text file queries
$dblTimeStart = Getmicrotime ();
for ($i = 0; $i < $i + +) {
$strUTF 8Region = Gb2utf8_file ($arrAddr [$i] ["region"]);
$strUTF 8Address = Gb2utf8_file ($arrAddr [$i] ["address"]);
}
$dblTimeDuration = Getmicrotime ()-$dblTimeStart;
Evaluation end and output results
Echo $dblTimeDuration; echo "\ r \ n";
?>

Evaluate two results (accurate to 3 decimal places, in seconds):

MySQL Query conversion: 0.112
Text Query conversions: 10.590

MySQL Query conversion: 0.099
Text Query conversions: 10.623

This is the MySQL method is far ahead of the file query method. But it's not a rush to use the MySQL method, because the text file method is so time-consuming, mainly because it reads the entire gb_unicode.txt into memory every time it transforms, and Gb_unicode.txt is a text file, in the following format:

0x2121 0x3000 # ideographic Space
0x2122 0x3001 # ideographic COMMA
0x2123 0x3002 # ideographic Full STOP
0x2124 0X30FB # Katakana Middle DOT
0x2125 0X02C9 # MODIFIER Letter Macron (Mandarin Chinese-a-tone)
......
0x552a 0x6458 # <CJK>
0x552b 0x658b # <CJK>
0X552C 0x5b85 # <CJK>
0x552d 0x7a84 # <CJK>
......
0x777b 0x9f37 # <CJK>
0x777c 0X9F3D # <CJK>
0x777d 0x9f3e # <CJK>
0x777e 0x9f44 # <CJK>

The text file is inefficient, so consider converting the text file into a binary file and then using the binary method to find the file without having to read the entire file into memory. File format is: File first 2 bytes, store the number of records, followed by a record into the file, each record 4 bytes, the first 2 bytes corresponding to GB code, followed by 2 bytes of Unicode code. The conversion process is as follows:

<?php
$arrLines = File ("Gb_unicode.txt");
foreach ($arrLines as $strLine) {
$arrCodeTable [Hexdec (substr ($strLine, 0, 6)] = Hexdec (substr ($strLine, 7, 6));
}
Ksort ($arrCodeTable);
$intCount = count ($arrCodeTable);
$strCount = chr ($intCount% 256). Chr (Floor ($intCount/256));
$fileGBU = fopen ("Gbu.dat", "WB");
Fwrite ($fileGBU, $strCount);
foreach ($arrCodeTable as $k => $v) {
$strData = chr ($k% 256). Chr (Floor ($K/256)). Chr ($v% 256). Chr (Floor ($V/256));
Fwrite ($fileGBU, $strData);
}
Fclose ($fileGBU);
?>

After executing the program, we get the binary Gb->unicode table Gbu.dat, and the data records are sorted according to the GB code, which is convenient for binary method to find. The function of transcoding using Gbu.dat is as follows:

function Gb2utf8_file1 ($strGB) {
if (!trim ($strGB)) return $strGB;
$fileGBU = fopen ("Gbu.dat", "RB");
$strBuf = Fread ($fileGBU, 2);
$intCount = Ord ($strBuf {0}) + 256 * ORD ($strBuf {1});
$strRet = "";
$intLen = strlen ($strGB);
for ($i = 0; $i < $intLen; $i + +) {
if (Ord ($strGB {$i}) > 127) {
$strCurr = substr ($strGB, $i, 2);
$intGB = Hexdec (Bin2Hex ($strCurr))-0x8080;
$intStart = 1;
$intEnd = $intCount;
while ($intStart < $intEnd-1) {//binary method lookup
$intMid = Floor (($intStart + $intEnd)/2);
$intOffset = 2 + 4 * ($intMid-1);
Fseek ($fileGBU, $intOffset);
$strBuf = Fread ($fileGBU, 2);
$intCode = Ord ($strBuf {0}) + 256 * ORD ($strBuf {1});
if ($intGB = = $intCode) {
$intStart = $intMid;
Break
}
if ($intGB > $intCode) $intStart = $intMid;
else $intEnd = $intMid;
}
$intOffset = 2 + 4 * ($intStart-1);
Fseek ($fileGBU, $intOffset);
$strBuf = Fread ($fileGBU, 2);
$intCode = Ord ($strBuf {0}) + 256 * ORD ($strBuf {1});
if ($intGB = = $intCode) {
$strBuf = Fread ($fileGBU, 2);
$intCodeU = Ord ($strBuf {0}) + 256 * ORD ($strBuf {1});
$strRet. = U2utf8 ($intCodeU);
} else {
$strRet. = "???";
}
$i + +;
} else {
$strRet. = $strGB {$i};
}
}
return $strRet;
}

Add it to the original evaluation procedure, three methods to evaluate the data at the same time 2 times (accurate to 3 decimal places, units: seconds):

MySQL method: 0.125
Text File Method: 10.873
Binary file binary method: 0.106

MySQL method: 0.102
Text File Method: 10.677
Binary file binary method: 0.092

Visible binary file binary method also has a slight advantage over the MySQL method. But the above assessment is to the short geographical location of the transfer code, if the longer text transcoding? I found 5 Blog RSS 2.0 files, are GB2312 encoding. Measuring the time spent on 5 file encodings by three methods, and 2 measured data as follows (accurate to 3 decimal places, in seconds):

MySQL method: 7.206
Text File Method: 0.772
Binary file binary method: 5.022

MySQL method: 7.440
Text File Method: 0.766
Binary file binary method: 5.055

Visible to long text is the best method of text file, because the Transfer Code table read into memory, transcoding can be very efficient. In this case, we can also try to improve the text file method to read: Transcoding table from the binary file Gbu.dat into memory, rather than text files. The evaluation data are as follows (accuracy and unit ibid.):

Read from a text file: 0.766
Read from binary file table: 0.831

Read from a text file: 0.774
Read from binary file table: 0.833

Indicates that this improvement has failed, and reading the transcoding table from a text file is more efficient.

Summary: PHP to the GB encoding to UTF-8 encoding dynamic conversion, if the text of each conversion is very small, it is appropriate to use binary files combined with the binary method conversion; If the text of each conversion is large, it is appropriate to store the transcoding table with a text file and read it into memory at once before conversion.



Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.