Sphinx Chinese Getting Started Guide (from Sphinx Chinese station)

Source: Internet
Author: User

Label:

Sphinx Chinese Getting Started Guide wuhuiming<blvming in GMAIL.COM&GT; reprint please indicate source and author last modified: January 23, 2010
    • 1. Introduction
    • What is 1.1.Sphinx?
    • Features of the 1.2.Sphinx
    • 1.3.Sphinx Chinese participle
    • 2. Installation Configuration Example
    • 2.1 Installing on the Gnu/linux/unix system
      • 2.1.1 Sphinx Installation
      • 2.1.2.SFC installation (see also article)
      • 2.1.3.coreseek installation (see also article)
    • 2.2 Installing under Windows
    • 3. Configuration Example
    • 4. Application
    • 4.1 Testing on the CLI
    • 4.2 Using API calls
    • 5. Appendix
1.Sphinx Introduction What 1.1.Sphinx is

Sphinx is a full-text search engine developed by the Russian people Andrew Aksyonoff. Intent to provide high-speed, low-footprint, high-result correlation full-text search capabilities for other applications. Sphinx can be very easy to integrate with SQL database and scripting languages. The current system includes support for MySQL and PostgreSQL database data sources, as well as reading XML data in a specific format from standard input. By modifying the source code, users can add their own new data sources (ex: native support for other types of DBMS)

Features of the 1.2.Sphinx
    • High-speed indexing (peak performance of up to ten MB/s on the current CPU);
    • High-performance search (on 2–4GB text data, the average response time for each retrieval is less than 0.1 seconds);
    • Processing of large amounts of data (currently known to process more than a few gigabytes of text data, on a single CPU system can process the documents of the millions M);
    • An excellent correlation algorithm is provided, which is based on the phrase similarity and statistic (BM25) composite ranking method.
    • Support distributed search;
    • Support Phrase Search
    • Provide document summary generation
    • Search service available as a storage engine for MySQL;
    • Support Boolean, phrase, word similarity and many other retrieval modes;
    • Document supports multiple full-text search fields (max. 32);
    • The document supports multiple additional attribute information (for example: grouping information, timestamp, etc.);
    • Support word breaking;
1.3.Sphinx Chinese participle

Chinese full-text search and English and other Latin series is not the same, the latter is based on the space and other special characters to break words, and Chinese is based on semantics to participle. Most databases currently do not support Chinese full-text search, such as MySQL. Therefore, the domestic emergence of some MySQL Chinese full-text search plugin, do better have hightman Chinese word. Sphinx If you need a full-text search for Chinese, you need to add some plugins. Among the plugins I know are coreseek and SFC.

    • Coreseek is now the most used Sphinx Chinese full-text search, it provides a sphinx design for the Chinese word breaker libmmseg . and provides a binary release of multiple systems, with Rpm,deb and binary packages in Windows. In addition, Coreseek also contributed the following to Sphinx:
      • GBK encoded data source support
      • Chinese word breaker using Chih-hao Tsai mmseg algorithm
      • Chinese User manual ( this Chinese manual provides great convenience for the domestic use of Sphinx, especially those who are not very good at English )
    • SFC (Sphinx-for-chinese) is provided by the Netizen happy brother another Chinese word-breaker. One of the dictionary uses is xdict. According to its introduction, after testing, the current version at index speed (Linux test platform) basically can reach the index UTF-8 half of the English, that is, the official claim half of the speed. (Time is mainly spent on participle). sphinx-for-chinese-0.9.10-dev-r2006.tar.gz is now available in sync with the latest version of Sphinx (Sphinx 0.9.10). This version added the sql_attr_string, after my test. It is easy to install and configure. Brother Happy there is another contribution to the word segmentation--php-mmseg, which is an extension of PHP for Chinese word segmentation.

Here, I would like to pay the greatest tribute to the above two authors.

    • Also, if you are not interested in Chinese participle. Or, you just need to implement similar functions like in SQL, such as: SELECT * from product where prodname '% phone% '. Sphinx will not let you down, this is perhaps the official website of the simple implementation of Chinese--directly to the word index. And the search speed is good ^_^.

This article will test the above three Chinese apps and document them, which may be the focus of this documentation.

2. Install the configuration instance 2.1 on the Gnu/linux/unix system

There are two ways to apply Sphinx on MySQL:
①, using API calls, such as PHP, Java, and other API functions or methods to query. The advantage is that MySQL can not be recompiled, the server process "low coupling", and the program can be flexible, convenient call;
The disadvantage is that if you have a search program, you need to modify some programs. Recommended for programmers to use.
②, use plug-in mode (SPHINXSE) to compile the Sphinx into a MySQL plug-in and use a specific SQL statement to retrieve. It is characterized by a convenient combination on the SQL side and the ability to return data directly to the client
Without two queries (note), only the corresponding SQL needs to be modified on the program, but this is inconvenient for programs developed using the framework, such as using ORM. MySQL also needs to be recompiled, and requires mysql-5.1 or more versions
Supports plug-in storage. System administrators can use this method
Two queries note: Until the release version--sphinx-0.9.9,sphinx can only return the ID of the record after retrieving the result, not the SQL data to be looked up, so we need to re-query from the database based on these IDs again.
The Sphinx 0.9.10 version is now available to store these text data, and the authors have tried to perform poorly on performance and storage, after all, without a formal version

This article uses the first method

Installation under the *nix system requires the following software support first

Software Environment:

    • Operating system: Centos-5.2
    • Database: Mysql-5.0.77-3.el5 mysql-devel (if you want to use SPHINXSE plug-in storage, use the mysql-5.1 version above)
    • Compiler software: gcc gcc-c++ autoconf automake
    • sphinx:sphinx-0.9.9 (latest stable version)

Installation:

    • [email protected] ~]# yum install-y MySQL Mysql-devel
    • [email protected] ~]# Yum install-y automake autoconf
    • [Email protected] ~]# cd/usr/local/src/
    • [Email protected] src]# wget http://www.sphinxsearch.com/downloads/sphinx-0.9.9.tar.gz
    • [Email protected] src]# tar zxvf sphinx-0.9.9.tar.gz
    • [Email protected] local]# CD sphinx-0.9.9
    • [[email protected] sphinx-0.9.9]#./configure–prefix=/usr/local/sphinx #注意: Here Sphinx has the default support for MySQL
    • [[email protected] sphinx-0.9.9]# makes && make install # where "warnings" can be ignored

After the installation to see if there are three directories bin/usr/local/sphinx, etc Var, if any, then the installation is correct!

2.1.2.SFC installation (click to enter)
2.1.3.coreseek installation (click to enter) 3. Configure instance 3.1, data source.

Here we use the MySQL data source. The details are as follows:

Mysql server:192.168.1.10

Mysql Db:test

Mysql table: Test.sphinx_article

mysql> desc sphinx_article;
+ ——— –+ ——————— +--+-–+ ——— + —————-+
| Field | Type | Null | Key | Default | Extra |
+ ——— –+ ——————— +--+-–+ ——— + —————-+
| ID | int (one) unsigned | NO | PRI | NULL | auto_increment |
| Title | varchar (255) | NO | | | |
| cat_id | tinyint (3) unsigned | NO | MUL | | |
| member_id | int (one) unsigned | NO | MUL | | |
| Content | Longtext | NO | | | |
| Created | Int (11) | NO | MUL | | |
+ ——— –+ ——————— +--+-–+ ——— + —————-+
6 rows in Set (0.00 sec)

3.2. Configuration files
    • [[Email protected] ~] #cd/usr/local/sphinx/etc #进入sphinx的配置文件目录
    • [email protected] etc]# CP sphinx.conf.dist sphinx.conf #新建Sphinx配置文件
    • [Email protected] etc]# vim sphinx.conf #编辑sphinx. conf

Specific instance configuration file:

##### Index Source ###########
SOURCE Article_src
{
type = mysql # # # #数据源类型
Sql_host = 192.168.1.10 ##### #mysql主机
Sql_user = root ####### #mysql用户名
Sql_pass = pwd########### #mysql密码
sql_db = Test ######## #mysql数据库名
sql_port= 3306 ########## #mysql端口
Sql_query_pre = SET NAMES UTF8 # # #mysql检索编码, especially to note this, many people in the Chinese language retrieval is the database encoding is GBK or other non-UTF8
Sql_query = SELECT id,title,cat_id,member_id,content,created from sphinx_article ####### get data sql

# # # # #以下是用来过滤或条件查询的属性 ############

Sql_attr_uint = cat_id ######## unsigned integer property
Sql_attr_uint = member_id
Sql_attr_timestamp = created ############ Unix Timestamp property

Sql_query_info = select * from sphinx_article where id= $id ######### test for Command interface side (CLI) calls

}

# # # index # #

Index article
{
Source = article_src # # # #声明索引源
Path =/usr/local/sphinx/var/data/article ###### #索引文件存放路径及索引的文件名
DocInfo = extern ##### How the document information is stored
Mlock = 0 # # #缓存数据内存锁定
morphology = none # # # # Morphology (Invalid for Chinese)
Min_word_len = 1 # # # # words Minimum length of index
Charset_type = utf-8 # # # # #数据编码

##### character form, note: If you use this method, Sphinx will split the Chinese
##### is the word index, to use Chinese word segmentation, you must use other word breakers such as CORESEEK,SFC

charset_table = U+ff10. u+ff19->0..9, 0..9, u+ff41. U+ff5a->a. Z, U+FF21. U+ff3a->a. Z,\
A.. Z->a. Z, A.. Z, u+0149, u+017f, u+0138, U+00DF, U+00FF, u+00c0. U+00d6->u+00e0. U+00f6,\
U+00e0. U+00f6, U+00d8. U+00de->u+00f8. U+00fe, U+00f8. U+00fe, u+0100->u+0101, u+0101,\
u+0102->u+0103, u+0103, u+0104->u+0105, u+0105, u+0106->u+0107, u+0107, u+0108->u+0109,\
u+0109, u+010a->u+010b, u+010b, u+010c->u+010d, u+010d, u+010e->u+010f, u+010f,\
u+0110->u+0111, u+0111, u+0112->u+0113, u+0113, u+0114->u+0115, u+0115, \
u+0116->u+0117,u+0117, u+0118->u+0119, u+0119, u+011a->u+011b, u+011b, u+011c->u+011d,\
u+011d,u+011e->u+011f, u+011f, u+0130->u+0131, u+0131, u+0132->u+0133, u+0133, \
u+0134->u+0135,u+0135, u+0136->u+0137, u+0137, u+0139->u+013a, u+013a, u+013b->u+013c, \
u+013c,u+013d->u+013e, u+013e, u+013f->u+0140, u+0140, u+0141->u+0142, u+0142, \
u+0143->u+0144,u+0144, u+0145->u+0146, u+0146, u+0147->u+0148, u+0148, u+014a->u+014b, \
u+014b,u+014c->u+014d, u+014d, u+014e->u+014f, u+014f, u+0150->u+0151, u+0151, \
u+0152->u+0153,u+0153, u+0154->u+0155, u+0155, u+0156->u+0157, u+0157, u+0158->u+0159,\
u+0159,u+015a->u+015b, u+015b, u+015c->u+015d, u+015d, u+015e->u+015f, u+015f, \
u+0160->u+0161,u+0161, u+0162->u+0163, u+0163, u+0164->u+0165, u+0165, u+0166->u+0167, \
u+0167,u+0168->u+0169, u+0169, u+016a->u+016b, u+016b, u+016c->u+016d, u+016d, \
u+016e->u+016f,u+016f, u+0170->u+0171, u+0171, u+0172->u+0173, u+0173, u+0174->u+0175,\
u+0175,u+0176->u+0177, u+0177, u+0178->u+00ff, u+00ff, u+0179->u+017a, u+017a, \
u+017b->u+017c,u+017c, u+017d->u+017e, u+017e, u+0410..u+042f->u+0430..u+044f, \
U+0430..u+044f,u+05d0. U+05ea, u+0531..u+0556->u+0561..u+0586, u+0561..u+0587, \
u+0621..u+063a, U+01B9,U+01BF, u+0640..u+064a, u+0660..u+0669, u+066e, u+066f, \
U+0671..U+06D3, u+06f0. u+06ff,u+0904..u+0939, u+0958..u+095f, u+0960..u+0963, \
u+0966..u+096f, u+097b. U+097F,U+0985..U+09B9, U+09ce, U+09DC. U+09e3, U+09e6. U+09ef, \
U+0a05. U+0a39, U+0a59. U+0a5e,u+0a66. u+0a6f, U+0a85. U+0AB9, U+0ae0. U+0ae3, \
U+0ae6. U+0AEF, U+0b05. u+0b39,u+0b5c. U+0b61, u+0b66. u+0b6f, U+0b71, u+0b85. U+0BB9, \
U+0be6. U+0BF2, U+0C05. U+0c39,u+0c66. u+0c6f, u+0c85. U+0CB9, U+0CDE. U+0ce3, \
U+0ce6. U+0CEF, u+0d05. U+0d39, U+0d60,u+0d61, U+0d66. u+0d6f, u+0d85. U+0DC6, \
u+1900..u+1938, u+1946..u+194f, u+a800. u+a805,u+a807. u+a822, u+0386->u+03b1, \
U+03AC-&GT;U+03B1, U+0388->u+03b5, u+03ad->u+03b5,u+0389->u+03b7, u+03ae->u+03b7, \
U+038A-&GT;U+03B9, U+0390-&GT;U+03B9, u+03aa->u+03b9,u+03af->u+03b9, U+03CA-&GT;U+03B9, \
U+038C-&GT;U+03BF, U+03CC-&GT;U+03BF, u+038e->u+03c5,u+03ab->u+03c5, U+03B0-&GT;U+03C5, \
U+03CB-&GT;U+03C5, U+03cd->u+03c5, u+038f->u+03c9,u+03ce->u+03c9, U+03C2-&GT;U+03C3, \
U+0391..u+03a1->u+03b1. U+03c1,u+03a3. U+03a9->u+03c3. U+03C9, u+03b1. U+03C1, \
U+03c3. U+03C9, U+0e01. U+0e2e,u+0e30. U+0E3A, U+0e40. U+0e45, U+0e47, U+0e50. U+0e59, \
u+a000. u+a48f, u+4e00. U+9FBF,U+3400..U+4DBF, U+20000..U+2A6DF, u+f900. U+faff, \
u+2f800. U+2FA1F, U+2e80. U+2eff,u+2f00. U+2FDF, u+3100..u+312f, u+31a0. U+31BF, \
u+3040..u+309f, u+30a0. U+30ff,u+31f0. U+31FF, u+ac00. U+D7AF, u+1100..u+11ff, \
u+3130..u+318f, u+a000. u+a48f,u+a490. U+a4cf
Min_prefix_len = 0 #最小前缀
Min_infix_len = 1 #最小中缀
Ngram_len = 1 # for length cutting of non-alphabetic data

#加上这个选项, each Chinese and English word will be split and will be slow
#ngram_chars = u+4e00. U+9FBF, U+3400..U+4DBF, U+20000..U+2A6DF, u+f900. U+faff,\
#U +2f800. U+2FA1F, U+2e80. U+2eff, u+2f00. U+2FDF, u+3100..u+312f, u+31a0. U+31bf,\
#U +3040..u+309f, u+30a0. U+30FF, u+31f0. U+31FF, u+ac00. U+D7AF, u+1100..u+11ff,\
#U +3130..u+318f, u+a000. u+a48f, u+a490. U+a4cf

}

######### Indexer Configuration #####
Indexer
{
Mem_limit = 256M ####### Memory limit
}

############ Sphinx Service Process ########
Searchd
{
#listen = 9312 # # # Listening Port, which, at the beginning of this version, has officially been authorized in the IANA 9312 port, the previous version default is 3312

log =/usr/local/sphinx/var/log/searchd.log # # # # # # # # # # # # # # Service Process log, once Sphinx exception, basically can query valid information from here, rotation (rotate) out of the question can generally find the answer here
Query_log =/usr/local/sphinx/var/log/query.log # # # Client query log, I note: If you want to statistics on some keywords, you can analyze this log file
Read_timeout = 5 # # Request timed out
Max_children = 30 # # # # Maximum number of searchd processes that can be executed simultaneously
Pid_file =/usr/local/sphinx/var/log/searchd.pid ###### #进程ID文件
max_matches = 1000 # # # # Maximum number of returns for query results
Seamless_rotate = 1 # # # # # # # # # # # Support Seamless switching, incremental indexing usually requires
}

3.3. Index file creation

[[email protected] sphinx]# bin/indexer-c etc/sphinx.conf Article # # # command to create an index file
Sphinx 0.9.9-release (r2117)
Copyright (c) 2001-2009, Andrew Aksyonoff

Using config file ' etc/sphinx.conf ' ...
Indexing index ' article ' ...
Collected Docs, 0.2 MB
Sorted 0.4 mHITs, 99.6% done
Total docs, 210559 bytes
Total 3.585 sec, 58723 bytes/sec, 278.89 docs/sec
Total 2 Reads, 0.031 sec, 1428.8 kb/call AVG, 15.6 msec/call AVG
Total One writes, 0.032 sec, 671.6 kb/call AVG, 2.9 msec/call avg
[Email protected] sphinx]#
Appear above the representative has been indexed successfully, if unsuccessful, please modify the configuration file according to the error of the prompt, or to ask here, I see will be resolved as soon as possible

4. Application 4.1 Testing on the CLI

In the previous step, we built the index, and now we are testing the newly created index. There are two ways of testing: CLI side and API call

The command test on the CLI side is a search command that uses the Sphinx.

###### search "Beijing" keyword on article index ########
[Email protected] sphinx]# bin/search-c etc/sphinx.conf Beijing
Sphinx 0.9.9-release (r2117)
Copyright (c) 2001-2009, Andrew Aksyonoff

Using config file ' etc/sphinx.conf ' ...
Index ' article ': Query ' Beijing ': returned 995 matches of 995 Total in 0.008 sec

displaying matches:
1. document=76, weight=2, cat_id=1, member_id=2, Created=sat Jan 23 19:05:09 2010
id=76
title=??????????
Cat_id=1
member_id=2
Content=????????????????????????????????
created=1264244709
2. document=85, weight=2, cat_id=1, member_id=2, Created=sat Jan 23 19:05:09 2010
id=85
title=????????????
Cat_id=1
member_id=2
Content=???????????????????????????????????????????????????????????
created=1264244709
...... omitted here ....
Document=17, Weight=1, cat_id=1, member_id=2, Created=sat Jan 23 19:05:09 2010
Id=17
title=????????????
Cat_id=1
member_id=2
Content=??????????????????????????????????????????????????????????
created=1264244709

Words
1. ' Beijing ': 995 documents, 999 hits

At this point, we can see that we have retrieved all the information about "Beijing"

Note: Here I am using the Putty client, the client encoding set is Utf-8, this is the prerequisite for testing

4.2 API calls

In this example, I use the PHP API to test, before the test, start the Sphinx service process, and the CentOS firewall to do 9312 port open

[[email protected] sphinx]# bin/searchd-c etc/sphinx.conf & # # Make Sphinx run in the background
[1] 5759
[Email protected] sphinx]# Sphinx 0.9.9-release (r2117)
Copyright (c) 2001-2009, Andrew Aksyonoff

Using config file ' etc/sphinx.conf ' ...
Listening on all interfaces, port=9312

[1]+ Done Bin/searchd-c etc/sphinx.conf

PHP Test Code:

<?php
Header (' content-type:text/html;charset=utf-8′ ');
><form name= "form1″method=" Get "action=" ">
<label>
<input style= "width:400px;" type= "text" name= "keyword" >
</label>
<label>
<input type= "Submit" name= "Submit" value= "Sphinx Search" >
</label>
</form>

<?php
$keyword = $_get[' keyword ');
if (Trim ($keyword) = = ") {
Die (' Please enter keywords ');
}
else {
Echo ' key word is: '. $keyword;
}

Require "sphinxapi.php";
$CL = new Sphinxclient ();
$cl->setserver (' 192.168.1.150′, 9312); Note the hosts here
# $CL->setmatchmode (sph_match_extended); Use multi-field mode
Dump ($CL);
$index = "article";
$res = $cl->query ($keyword, $index);
$err = $cl->getlasterror ();
Dump ($res);
function Dump ($var)
{
Echo ' <pre> ';
Var_dump ($var);
Echo ' </pre> ';
}
?>

After retrieving the "Beijing" dump, the result is as follows:

Array (Ten) {["Error"]=> string (0) "" ["Warning"]=> string (0) "" ["Status"]=> int (0) ["Fields"]=> array (2) {[0]=> string (5) "title" [1]=> string (7) "Content"} ["Attrs"]=> Array (3) {["cat_id"]=&gt    ; int (1) ["member_id"]=> int (1) ["Created"]=> int (2)} ["Matches"]=> Array () {[76]=> arr Ay (2) {["Weight"]=> string (1) "2" ["Attrs"]=> Array (3) {["cat_id"]=> string (1 "1" ["member_id"]=> string (1) "2" ["Created"]=> string (10) "1264244709"}}. ....    omitted here .....         [17]=> Array (2) {["Weight"]=> string (1) "1" ["Attrs"]=> Array (3) {["cat_id"]=> String (1) "1" ["member_id"]=> string (1) "2" ["Created"]=> string (10) "1264244709   "}}} [" Total "]=> string (3)" 995 "[" Total_found "]=> string (3)" 995 "[" Time "]=> string (5)" 0.008 "["Words"]=> Array (1) {["Beijing"]=> Array (2) {["Docs"]=> string (3) "995" ["Hits"]=> St Ring (3) "999"}}

Now PHP can call the results!

Sphinx Chinese Getting Started Guide (from Sphinx Chinese station)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

Tags Index: