Evaluation of several methods for dynamic conversion of GB encoding to UTF-8 by PHP

Source: Internet
Author: User
In the evaluation of IP address-& gt; geographic position conversion, ip2addr function is the most efficient function to directly read IP database files. compared with MySQL database, SQL queries are the most efficient. However, the IP database file QQWry. dat is GB2312 encoded. Now I need the geographic status results of UTF-8 encoding. If MySQL is used, data can be converted

The article "IP address-> geographic status conversion evaluation" mentions that using the ip2addr function to directly read IP database files is the most efficient. compared with using the MySQL database to store IP data, SQL queries are the most efficient. However, the IP database file QQWry. dat is GB2312 encoded. Now I need the geographic status results of UTF-8 encoding. If MySQL method can be used, the data can be converted to UTF-8 encoding when stored in the database, once and for all. However, the QQWry. dat file cannot be modified, and the output of the ip2addr function can only be converted dynamically.

There are at least four methods for dynamic conversion of GB-> UTF-8 encoding:

Use PHP's iconv to expand the conversion

Extended conversion using PHP mb_string

Use swap table conversion, swap table stored in MySQL database

Convert tables with swap tables and store them in text files.

The first two methods require the server to make corresponding settings (compilation and installation of the corresponding expansion) for the application. My VM does not have these two extensions, so I have to consider the last two methods. The first two methods are not evaluated in this document.

The evaluation procedure is as follows (for func_ip.php, see the article "IP address-> geographic status conversion evaluation ):

Require_once (\ "func_ip.php \\\");
Function u2utf8 ($ c ){
$ Str = \\\"\\\";
If ($ c <0x80 ){
$ Str. = $ c;
} Elseif ($ c <0x800 ){
$ Str. = chr (0xC0 | $ c> 6 );
$ Str. = chr (0x80 | $ c & 0x3F );
} Elseif ($ c <0x10000 ){
$ Str. = chr (0xE0 | $ c> 12 );
$ Str. = chr (0x80 | $ c> 6 & 0x3F );
$ Str. = chr (0x80 | $ c & 0x3F );
} Elseif ($ c <0x200000 ){
$ Str. = chr (0xF0 | $ c> 18 );
$ Str. = chr (0x80 | $ c> 12 & 0x3F );
$ Str. = chr (0x80 | $ c> 6 & 0x3F );
$ Str. = chr (0x80 | $ c & 0x3F );
}
Return $ str;
}
Function GB2UTF8_ SQL ($ strGB ){
If (! Trim ($ strGB) return $ strGB;
$ StrRet = \\\"\\\";
$ IntLen = strlen ($ strGB );
For ($ I = 0; $ I <$ intLen; $ I ){
If (ord ($ strGB {$ I})> 127 ){
$ StrCurr = substr ($ strGB, $ I, 2 );
$ IntGB = hexdec (bin2hex ($ strCurr)-0x8080;
$ StrSql =\\ "SELECT code_unicode FROM nnstats_gb_unicode
WHERE code_gb =\\\ ". $ intGB. \\\" LIMIT 1 \\\"
;
$ ResResult = mysql_query ($ strSql );
If ($ arrCode = mysql_fetch_array ($ resResult) $ strRet. = u2utf8 ($ arrCode [\ "code_unicode \"]);
Else $ strRet. = \\\"?? \\\";
$ I;
} Else {
$ StrRet. = $ strGB {$ I };
}
}
Return $ strRet;
}
Function GB2UTF8_FILE ($ strGB ){
If (! Trim ($ strGB) return $ strGB;
$ ArrLines = file (\\\ "gb_unicode.txt \\\");
Foreach ($ arrLines as $ strLine ){
$ ArrCodeTable [hexdec (substr ($ strLine, 0, 6)] = hexdec (substr ($ strLine, 7, 6 ));

 

}
$ StrRet = \\\"\\\";
$ IntLen = strlen ($ strGB );
For ($ I = 0; $ I <$ intLen; $ I ){
If (ord ($ strGB {$ I})> 127 ){
$ StrCurr = substr ($ strGB, $ I, 2 );
$ IntGB = hexdec (bin2hex ($ strCurr)-0x8080;
If ($ arrCodeTable [$ intGB]) $ strRet. = u2utf8 ($ arrCodeTable [$ intGB]);
Else $ strRet. = \\\"?? \\\";
$ I;
} Else {
$ StrRet. = $ strGB {$ I };
}
}
Return $ strRet;
}
Function EncodeIp ($ strDotquadIp ){
$ ArrIpSep = explode (\\\'. \\\ ', $ strDotquadIp );
If (count ($ arrIpSep )! = 4) return 0;
$ IntIp = 0;
Foreach ($ arrIpSep as $ k => $ v) $ intIp = (int) $ v * pow (256, 3-$ k );
Return $ intIp;
// Return sprintf (\ 'xxxx \ ', $ arrIpSep [0], $ arrIpSep [1], $ arrIpSep [2], $ arrIpSep [3]);
}
Function GetMicroTime (){
List ($ msec, $ sec) = explode (\\\ "\\\", microtime ());
Return (double) $ msec (double) $ sec );
}
For ($ I = 0; $ I <100; $ I) {// 100 IP addresses are randomly generated.
$ StrIp = mt_rand (0,255 ). \\\". \\\". mt_rand (0,255 ). \\\". \\\". mt_rand (0,255 ). \\\". \\\". mt_rand (0,255 );
$ ArrAddr [$ I] = ip2addr (EncodeIp ($ strIp ));
}
$ ResConn = mysql_connect (\ "localhost \", \ "netnest \", \ "netnest \\\");
Mysql_select_db (\ "test \\\");
// Evaluate the encoding conversion of MySQL queries
$ DblTimeStart = GetMicroTime ();
For ($ I = 0; I I <100; $ I ){
$ StrUTF8Region = GB2UTF8_ SQL ($ arrAddr [$ I] [\ "region \"]);
$ StrUTF8Address = GB2UTF8_ SQL ($ arrAddr [$ I] [\ "address \"]);
}
$ DblTimeDuration = GetMicroTime ()-$ dblTimeStart;
// The evaluation is complete and the results are output
Echo $ dblTimeDuration; echo \ "\ r \ n \\\";
// Evaluate the encoding conversion of text file queries
$ DblTimeStart = GetMicroTime ();
For ($ I = 0; I I <100; $ I ){
$ StrUTF8Region = GB2UTF8_FILE ($ arrAddr [$ I] [\ "region \"]);
$ StrUTF8Address = GB2UTF8_FILE ($ arrAddr [$ I] [\ "address \"]);
}
$ DblTimeDuration = GetMicroTime ()-$ dblTimeStart;
// The evaluation is complete and the results are output
Echo $ dblTimeDuration; echo \ "\ r \ n \\\";
?>

Evaluate the results twice (accurate to 3 decimal places, in seconds ):

MySQL Query conversion: 0.112
Text query conversion: 10.590

MySQL Query conversion: 0.099
Text query conversion: 10.623

It can be seen that this MySQL method is far ahead of the File query method. Pipeline is also a text file. the layout is as follows:

0x2121 0x3000 # IDEOGRAPHIC SPACE
0x2122 0x3001 # IDEOGRAPHIC COMMA
0x2123 0x3002 # IDEOGRAPHIC FULL STOP
0x2124 0x30FB # KATAKANA MIDDLE DOT
0x2125 0x02C9 # modifier letter macron (Mandarin Chinese first tone)
......
0x552A 0x6458 #

 

0x552B 0x658B #
0x552C 0x5B85 #
0x552D 0x7A84 #
......
0x777B 0x9F37 #
0x777C 0x9F3D #
0x777D 0x9F3E #
0x777E 0x9F44 #

Text files are less efficient. Therefore, you need to convert a text file to a binary file and then use the half-fold method to search for the file without reading all the files into the memory. The file format is as follows: the file header is 2 bytes, and the number of records is stored. The record is saved to the file one by one. each record is 4 bytes. The first 2 bytes correspond to the GB code, and the last 2 bytes correspond to the Unicode code. The conversion procedure is as follows:

$ ArrLines = file (\\\ "gb_unicode.txt \\\");
Foreach ($ arrLines as $ strLine ){
$ ArrCodeTable [hexdec (substr ($ strLine, 0, 6)] = hexdec (substr ($ strLine, 7, 6 ));
}
Ksort ($ arrCodeTable );
$ IntCount = count ($ arrCodeTable );
$ StrCount = chr ($ intCount % 256). chr (floor ($ intCount/256 ));
$ FileGBU = fopen (\ "gbu. dat \", \ "wb \\\");
Fwrite ($ fileGBU, $ strCount );
Foreach ($ arrCodeTable as $ k => $ v ){
$ StrData = chr ($ k % 256 ). chr (floor ($ k/256 )). chr ($ v % 256 ). chr (floor ($ v/256 ));
Fwrite ($ fileGBU, $ strData );
}
Fclose ($ fileGBU );
?>
After the program is executed, the binary GB-> Unicode table gbu. dat is obtained, and the data records are sorted by the GB code for easy searching. The function for transcoding using gbu. dat is as follows:

Function GB2UTF8_FILE1 ($ strGB ){
If (! Trim ($ strGB) return $ strGB;
$ FileGBU = fopen (\\\ "gbu. dat \\\", \\\ "rb \\\");
$ StrBuf = fread ($ fileGBU, 2 );
$ IntCount = ord ($ strBuf {0}) 256 * ord ($ strBuf {1 });
$ StrRet = \\\"\\\";
$ IntLen = strlen ($ strGB );
For ($ I = 0; $ I <$ intLen; $ I ){
If (ord ($ strGB {$ I})> 127 ){
$ StrCurr = substr ($ strGB, $ I, 2 );
$ IntGB = hexdec (bin2hex ($ strCurr)-0x8080;
$ IntStart = 1;
$ IntEnd = $ intCount;
While ($ intStart <$ intEnd-1) {// half-way query
$ IntMid = floor ($ intStart $ intEnd)/2 );
$ Effecffset = 2 4 * ($ intMid-1 );
Fseek ($ fileGBU, $ effecffset );
$ StrBuf = fread ($ fileGBU, 2 );
$ IntCode = ord ($ strBuf {0}) 256 * ord ($ strBuf {1 });
If ($ intGB = $ intCode ){
$ IntStart = $ intMid;
Break;
}
If ($ intGB> $ intCode) $ intStart = $ intMid;
Else $ intEnd = $ intMid;
}
$ Effecffset = 2 4 * ($ intStart-1 );
Fseek ($ fileGBU, $ effecffset );

 

$ StrBuf = fread ($ fileGBU, 2 );
$ IntCode = ord ($ strBuf {0}) 256 * ord ($ strBuf {1 });
If ($ intGB = $ intCode ){
$ StrBuf = fread ($ fileGBU, 2 );
$ IntCodeU = ord ($ strBuf {0}) 256 * ord ($ strBuf {1 });
$ StrRet. = u2utf8 ($ intCodeU );
} Else {
$ StrRet. = \\\"?? \\\";
}
$ I;
} Else {
$ StrRet. = $ strGB {$ I };
}
}
Return $ strRet;
}
Add it to the original evaluation program, and evaluate the three methods twice at the same time to obtain the data (accurate to 3 decimal places, in seconds ):

MySQL method: 0.125
Text file method: 10.873
Binary file half method: 0.106

MySQL method: 0.102
Text file method: 10.677
Binary file half method: 0.092

It can be seen that the binary file half method is better than the MySQL method. However, the above evaluation is to transcode the short geographic position. what if we transcode the long text? I found the RSS 2.0 files of five blogs, all of which are GB2312 encoded. Three methods of evaluation are used to encode the five files. the measurement data is as follows (accurate to 3 decimal places, in seconds ):

MySQL method: 7.206
Text file method: 0.772
Binary file half method: 5.022

MySQL method: 7.440
Text file method: 0.766
Binary file half method: 5.055

It can be seen that the method of using text files for long texts is optimal. since the transcoding table reads into the memory, transcoding can be very efficient. In this case, we can try to improve the text file method to read the transcoding table from the binary file gbu. dat into the memory instead of the text file. The evaluation data is as follows (the accuracy and unit are the same as above ):

Read from a text file to the comparison table: 0.766
Read the table from the binary file: 0.831

Read from a text file to the comparison table: 0.774
Read the table from the binary file: 0.833

This improvement failed. it is more efficient to read text files into the transcoding table.

Summary: using PHP to the dynamic conversion of GB encoding to UTF-8 encoding, if each conversion of the text is very small, combined with the practical binary file combined with the half-way conversion; if each conversion of the text is large, combined with the practical text file storage transcoding table, and read the table into the memory at one time before conversion.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.