Introduction: This is a detailed page for implementing full-text retrieval of utf8 Chinese websites based on MySQL database. It introduces PHP, related knowledge, skills, experience, and some PHP source code.
Class = 'pingjiaf' frameborder = '0' src = 'HTTP: // biancheng.dnbc?info/pingjia.php? Id = 323145 'rolling = 'no'>
On the Internet today, many websites provide full-text search functions. Visitors can search for specific materials by entering keywords or phrases. In PHP + MySQL architecture websites, the common practice is to search by the SELECT query like statement, which has the disadvantages of inaccurate search and low efficiency. For example, the like operation on a text field with tens of thousands of records in a data table may take about 10 seconds, which is a bad user experience for website viewers. How can full-text search be performed quickly in massive volumes of data? MySQL provides a full-text index function, that is, to set the Fulltext index attribute for the field, and then search through the select match against statement.
Touchus-the global Yellow Pages & Business Directory (www.touchus.org), a pure English site we developed, uses this feature of MySQL, the average full-text retrieval time for over 10 million pieces of data is less than 0.5 seconds. However, when developing touchus's Chinese website, www.city39.cn, we encountered a new problem. In the original English layout, words are differentiated by spaces. Fulltext can be fully supported, but it is not so simple for Chinese or East Asian words, because Chinese words and words are not clearly separated, MySQL does not support full-text retrieval of Chinese characters.
How can MySQL support full-text retrieval in Chinese? By accident, we came up with the idea that we can encode Chinese characters into English characters after Chinese word segmentation, so that we can establish a specific connection between Chinese and English characters, then we can perform full-text search. Isn't the full-text index of Chinese implemented? After the test, the answer is yes. The following describes the implementation process in the Yellow Pages of the city:
1. Create a separate index table. For example, create a members_index table for the Members table.
User information table (members_index)
user_id
user_name index_intro
user_introduction
Add Fulltext indexes to index_intro of the members_index table.
2. Perform Chinese Word Segmentation on the user_introduction field of the user information table (Members)
For the process of Processing Chinese word segmentation, refer to the simple Chinese Word Segmentation System http://www.ftphp.com/scws/. on the yellow page of the city, we wrote the PHP extension phrase of scwsto implement the Chinese word segmentation. The PHP extension module of scws is easy to install and can be used after simple compilation and configuration. In the specific PHPCode, We have written the following function to achieve word segmentation, and then connect the word segmentation results with spaces.
// Chinese Word Segmentation Function
Function str_fc ($ Str ){
$ So = scws_new ();
$ So-> set_charset ('utf8 ');
// If set_dict and set_rule are not called, the system automatically calls the dictionary and rule files in the specified path in ini.
$ So-> send_text ($ Str );
While ($ TMP = $ so-> get_result ())
{
Foreach ($ TMP as $ SS ){
$ S = trim ($ ss [word]);
If ($ S)
$ Mystr. = trim ($ ss [word]). "";
// Echo urlencode (TRIM ($ ss [word]). "";
}
}
Return $ mystr;
}
The result returned by this function is a word splitting result connected by space.
3. encode the word segmentation result. Multiple encoding methods can be used, such as base64 encoding, urlencode encoding, and converting Chinese characters to pinyin. For gb2312, you can even use the location code encoding method. Considering the storage space and convenience, we adopt the urlencode encoding method of PHP. Note that, before encoding, we can remove duplicate word segmentation to save storage space. After encoding, we need to remove the % symbol in the encoding result, because urlencode uses RFC 1738 for encoding, many % are generated, while % is a wildcard in MySQL. The following is the PHP code used in the encoding process.
$ DATA = str_fc ($ data); // Chinese Word Segmentation $ DATA = array_filter (explode ("", $ data); // delete an empty array $ DATA = array_flip ($ data); // Delete duplicate items // Perform urlcode encoding on the word splitting result Foreach ($ data as $ SS ){ If (strlen ($ SS)> 1) $ Data_code. = str_replace ("%", "", urlencode ($ SS )).""; } $ Data_code is the encoded result. Store the encoding result to the full text of user information according to user_id Reference Table (members_index) |
4. during the search process, the user first performs the same word segmentation encoding for the keywords entered by the user, and then uses the MySQL select match against statement to quickly search the full text, you can call the raw data in the user information table (members) to display the data according to the retrieved user_id, without the need for a decoding and restructuring.
The above MySQL utf8 full-text search method is currently running well in our two Chinese websites: The city Yellow Pages Network (www.city39.cn) and the Enterprise Supply and Demand Information Network (www.myglobalmarket.cn, the average retrieval time is less than 0.5 seconds.
More articles on "Implementation of full-text retrieval of utf8 Chinese website based on MySQL Database"
Love J2EE follow Java Michael Jackson video station JSON online tools
Http://biancheng.dnbcw.info/php/323145.html pageno: 16.