PHP on several methods of dynamic conversion of GB encoding UTF-8 evaluation

PHP on several methods of dynamic conversion of GB encoding UTF-8 evaluation _ PHP Tutorial-php Tutorial

Last Update:2017-05-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

PHP on several methods of dynamic conversion of GB encoding to UTF-8 evaluation. The article "IP address-Geographic location conversion evaluation" mentions that using the ip2addr function to directly read IP database files is the most efficient. compared with using the MySQL database to store IP data, SQL query is the most efficient way to use the ip2addr function to directly read IP database files. compared with MySQL database, SQL queries are the most efficient. However, the IP database file QQWry. dat is GB2312 encoded. Now I need the geographic location results for UTF-8 encoding. If you use the MySQL method, you can convert the data into UTF-8 encoding when it is stored in the database, once and for all. However, the QQWry. dat file cannot be modified, and the output result of the ip2addr function can only be dynamically converted.

There are at least four methods for dynamic conversion of GB-> UTF-8 encoding:

Extended conversion using PHP iconv

Extended conversion using PHP mb_string

Use swap table conversion, swap table stored in MySQL database

Convert tables with swap tables and store them in text files.

The first two methods can be used only when the server has made corresponding settings (the corresponding extensions are compiled and installed. My VM does not have these two extensions, so I have to consider the last two methods. The first two methods are not evaluated in this document.

The evaluation procedure is as follows (for func_ip.php, see the article "IP address-> geographic location conversion evaluation ):

Require_once ("func_ip.php ");
Function u2utf8 ($ c ){
$ Str = "";
If ($ c <0x80 ){
$ Str. = $ c;
} Elseif ($ c <0x800 ){
$ Str. = chr (0xC0 | $ c> 6 );
$ Str. = chr (0x80 | $ c & 0x3F );
} Elseif ($ c <0x10000 ){
$ Str. = chr (0xE0 | $ c> 12 );
$ Str. = chr (0x80 | $ c> 6 & 0x3F );
$ Str. = chr (0x80 | $ c & 0x3F );
} Elseif ($ c <0x200000 ){
$ Str. = chr (0xF0 | $ c> 18 );
$ Str. = chr (0x80 | $ c> 12 & 0x3F );
$ Str. = chr (0x80 | $ c> 6 & 0x3F );
$ Str. = chr (0x80 | $ c & 0x3F );
}
Return $ str;
}
Function GB2UTF8_ SQL ($ strGB ){
If (! Trim ($ strGB) return $ strGB;
$ StrRet = "";
$ IntLen = strlen ($ strGB );
For ($ I = 0; $ I <$ intLen; $ I ++ ){
If (ord ($ strGB {$ I})> 127 ){
$ StrCurr = substr ($ strGB, $ I, 2 );
$ IntGB = hexdec (bin2hex ($ strCurr)-0x8080;
$ StrSql = "SELECT code_unicode FROM nnstats_gb_unicode
WHERE code_gb = ". $ intGB." LIMIT 1"
;
$ ResResult = mysql_query ($ strSql );
If ($ arrCode = mysql_fetch_array ($ resResult) $ strRet. = u2utf8 ($ arrCode ["code_unicode"]);
Else $ strRet. = "?? ";
$ I ++;
} Else {
$ StrRet. = $ strGB {$ I };
}
}
Return $ strRet;
}
Function GB2UTF8_FILE ($ strGB ){
If (! Trim ($ strGB) return $ strGB;
$ ArrLines = file ("gb_unicode.txt ");
Foreach ($ arrLines as $ strLine ){
$ ArrCodeTable [hexdec (substr ($ strLine, 0, 6)] = hexdec (substr ($ strLine, 7, 6 ));
}
$ StrRet = "";
$ IntLen = strlen ($ strGB );
For ($ I = 0; $ I <$ intLen; $ I ++ ){
If (ord ($ strGB {$ I})> 127 ){
$ StrCurr = substr ($ strGB, $ I, 2 );
$ IntGB = hexdec (bin2hex ($ strCurr)-0x8080;
If ($ arrCodeTable [$ intGB]) $ strRet. = u2utf8 ($ arrCodeTable [$ intGB]);
Else $ strRet. = "?? ";
$ I ++;
} Else {
$ StrRet. = $ strGB {$ I };
}
}
Return $ strRet;
}
Function EncodeIp ($ strDotquadIp ){
$ ArrIpSep = explode (., $ strDotquadIp );
If (count ($ arrIpSep )! = 4) return 0;
$ IntIp = 0;
Foreach ($ arrIpSep as $ k => $ v) $ intIp + = (int) $ v * pow (256, 3-$ k );
Return $ intIp;
// Return sprintf (\ % 02x % 02x % 02x % 02x, $ arrIpSep [0], $ arrIpSep [1], $ arrIpSep [2], $ arrIpSep [3]);
}
Function GetMicroTime (){
List ($ msec, $ sec) = explode ("", microtime ());
Return (double) $ msec + (double) $ sec );
}
For ($ I = 0; $ I <100; $ I ++) {// 100 IP addresses are randomly generated.
$ StrIp = mt_rand (0,255). ".". mt_rand (0,255). ".". mt_rand (0,255). ".". mt_rand (0,255 );
$ ArrAddr [$ I] = ip2addr (EncodeIp ($ strIp ));
}
$ ResConn = mysql_connect ("localhost", "netnest", "netnest ");
Mysql_select_db ("test ");
// Evaluate the encoding conversion of MySQL queries
$ DblTimeStart = GetMicroTime ();
For ($ I = 0; I I <100; $ I ++ ){
$ StrUTF8Region = GB2UTF8_ SQL ($ arrAddr [$ I] ["region"]);
$ StrUTF8Address = GB2UTF8_ SQL ($ arrAddr [$ I] ["address"]);
}
$ DblTimeDuration = GetMicroTime ()-$ dblTimeStart;
// The evaluation is complete and the result is output
Echo $ dblTimeDuration; echo "";
// Evaluate the encoding conversion of text file queries
$ DblTimeStart = GetMicroTime ();
For ($ I = 0; I I <100; $ I ++ ){
$ StrUTF8Region = GB2UTF8_FILE ($ arrAddr [$ I] ["region"]);
$ StrUTF8Address = GB2UTF8_FILE ($ arrAddr [$ I] ["address"]);
}
$ DblTimeDuration = GetMicroTime ()-$ dblTimeStart;
// The evaluation is complete and the result is output
Echo $ dblTimeDuration; echo "";
?>

Evaluate the results twice (precise to three decimal places, in seconds ):

MySQL Query conversion: 0.112
Text query conversion: 10.590

MySQL Query conversion: 0.099
Text query conversion: 10.623

The MySQL method is far ahead of the File query method. The delimiter is a text file in the following format:

0x2121 0x3000 # IDEOGRAPHIC SPACE
0x2122 0x3001 # IDEOGRAPHIC COMMA
0x2123 0x3002 # IDEOGRAPHIC FULL STOP
0x2124 0x30FB # KATAKANA MIDDLE DOT
0x2125 0x02C9 # modifier letter macron (Mandarin Chinese first tone)
......
0x552A 0x6458 #
0x552B 0x658B #
0x552C 0x5B85 #
0x552D 0x7A84 #
......
0x777B 0x9F37 #
0x777C 0x9F3D #
0x777D 0x9F3E #
0x777E 0x9F44 #

Text files are less efficient. Therefore, you need to convert a text file to a binary file and then use the half-fold method to find the file without reading the entire file into the memory. The file format is: the file header 2 bytes, the number of records stored; then one record is saved to the file, each record 4 bytes, the first 2 bytes corresponds to the GB code, the last 2 bytes corresponds to the Unicode code. The conversion procedure is as follows:

$ ArrLines = file ("gb_unicode.txt ");
Foreach ($ arrLines as $ strLine ){
$ ArrCodeTable [hexdec (substr ($ strLine, 0, 6)] = hexdec (substr ($ strLine, 7, 6 ));
}
Ksort ($ arrCodeTable );
$ IntCount = count ($ arrCodeTable );
$ StrCount = chr ($ intCount % 256). chr (floor ($ intCount/256 ));
$ FileGBU = fopen ("gbu. dat", "wb ");
Fwrite ($ fileGBU, $ strCount );
Foreach ($ arrCodeTable as $ k => $ v ){
$ StrData = chr ($ k % 256 ). chr (floor ($ k/256 )). chr ($ v % 256 ). chr (floor ($ v/256 ));
Fwrite ($ fileGBU, $ strData );
}
Fclose ($ fileGBU );
?>
After the program is executed, the binary GB-> Unicode table gbu. dat is obtained, and the data records are sorted by the GB code for easy searching. Functions for transcoding using gbu. dat are as follows:

Function GB2UTF8_FILE1 ($ strGB ){
If (! Trim ($ strGB) return $ strGB;
$ FileGBU = fopen ("gbu. dat", "rb ");
$ StrBuf = fread ($ fileGBU, 2 );
$ IntCount = ord ($ strBuf {0}) + 256 * ord ($ strBuf {1 });
$ StrRet = "";
$ IntLen = strlen ($ strGB );
For ($ I = 0; $ I <$ intLen; $ I ++ ){
If (ord ($ strGB {$ I})> 127 ){
$ StrCurr = substr ($ strGB, $ I, 2 );
$ IntGB = hexdec (bin2hex ($ strCurr)-0x8080;
$ IntStart = 1;
$ IntEnd = $ intCount;
While ($ intStart <$ intEnd-1) {// half-way query
$ IntMid = floor ($ intStart + $ intEnd

...

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

PHP on several methods of dynamic conversion of GB encoding UTF-8 evaluation _ PHP Tutorial-php Tutorial

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

PHP on several methods of dynamic conversion of GB encoding UTF-8 evaluation _ PHP Tutorial-php Tutorial

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support