Sphinx and Coreseek

Last Update:2015-05-04 Source: Internet

Author: User

Tags add time automake

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Sphinx

1 Downloads Sphinx
http://sphinxsearch.com/

2 compiling the installation
TAR-ZXVF sphinx.tar.gz
./configure--prefix=/usr/local/sphinx--with-mysql=/usr/local/mysql make && make install
Important Three commands in Sphinx (Sphinx installed bin directory)
Indexer the CREATE INDEX command. Searchd the START process command. The search command line searches for commands.

3 Preparation
Import Example.sql
Mysql-u Test </usr/local/sphinx/etc/example.sql

4 Configuring Sphinx.conf
CP Sphinx.conf.dist sphinx.conf

Sphinx.conf contains several code snippets
main data Source:
SOURCE main{
}
Incremental Data Source:
SOURCE delta:main{
}

Master Data index:
Index main{
}
Incremental Data index:
Index delta:main{
}

Distributed indexes:
Index dist1{
}
Real-time indexing:
Index rt{
}

Indexer:
indexer{
}
Service Process:
searchd{
}

Common Settings
common{
}

Create an index
Sphinx configuration file configuration is complete, the data is done, and the index is started to create:
CREATE INDEX command: Indexer
-C Specify configuration file
--all to reindex all indexes
--rotate is used to rotate the index, mainly when the service is not stopped, to increase the index
--merge Merging Indexes
/usr/local/sphinx/bin/indexer-c/usr/local/sphinx/etc/sphinx.conf--all
If error:/usr/local/sphinx/bin/indexer:error while loading shared libraries:libmysqlclient.so.16:cannot open shared Object File:no such file or directory

Locate libmysqlclient find libmysqlclient.so.16 file path
cp/usr/local/mysql/lib/mysql/libmysqlclient.so.16/usr/lib/libmysqlclient.so.16

Test
Check Data command: Search
/usr/local/sphinx/bin/search-c/usr/local/sphinx/etc/sphinx.conf Test

Coreseek

Installation:
http://www.coreseek.cn Download
Unzip and CD to the MMSEG directory:
./configure--prefix=/usr/local/mmseg
If the report cannot find input file:src/makefile.in wrong run Automake
About automake:http://blog.csdn.net/fb408487792/article/details/45391171
Run/USR/LOCAL/MMSEG/BIN/MMSEG with information indicating success

CD to csft directory:
./configure--prefix=/usr/local/coreseek--with-mysql=/usr/local/mysql--with-mmseg=/usr/local/mmseg-- with-mmseg-includes=/usr/local/mmseg/include/mmseg/--with-mmseg-libs=/usr/local/mmseg/lib/
Make && make install

Configure Sphinx configuration file with Chinese word segmentation
The configuration file is the same as the above steps, but in Coreseek, there are several places to be aware of.
Note: The Coreseek profile is csft.conf, not sphinx.conf
Cd/usr/local/coreseek/etc
CP Sphinx.conf.dist csft.conf
Vim csft.conf
Other places are the same, with a different place to modify
Index Test1
{
#stopwords =/data/stopwords.txt
#wordforms =/data/wordforms.txt
#exceptions =/data/exceptions.txt
#charset_type = SBCs
Add the following two lines, meaning to add Chinese participle to the configuration file
Charset_type = Zh_cn.utf-8
Charset_dictpath =/usr/local/mmseg/etc/#你安装mmseg的目录
}

Create an index
/usr/local/coreseek/bin/indexer-c/usr/local/coreseek/etc/csft.conf--all
Test again to search Chinese
/usr/local/coreseek/bin/search [-A]-c/usr/local/coreseek/etc/csft.conf ' configuration '
Note: If you set the Coreseek profile to csft.conf, then Inder, search, and searchd do not need to take-c/usr/local/coreseek/etc/csft.conf, because the default is to look for this file.

PHP to use Sphinx

Sphinx is integrated into a PHP program in two ways:
1.Sphinx PHP Module
2.Sphinxapi class

To use the Sphinx step:
1, first have to have data
2. Establish Sphinx configuration file
3. Build the Index
4, start the SEARCHD service process, and open the port 9312
5, use PHP client program to connect Sphinx service

first, enable the Sphinx service
To use Sphinx in your program, you must turn on the Sphinx service
Start process command: Searchd
-C #指定配置文件
--stop #是停止服务
--pidfile #用来显式指定一个 PID File
-P #指定端口
/usr/local/coreseek/bin/searchd-c/usr/local/coreseek/etc/csft.conf
Note: The service started here is Searchd, not search
Sphinx The default port is 9312 port

second, using PHP connection to use the Sphinx program
(1) Full PHP load Sphinx module
wget http://pecl.php.net/get/sphinx-1.1.0.tgz
Tar zxf sphinx-1.1.0.tgz
cd/www/soft/sphinx-1.1.0
/usr/local/php/bin/phpize
./configure--with-php-config=/usr/local/webserver/php/bin/php-config
If prompted: Checking for libsphinxclient headers in default path ... not found configure:
Error:cannot Find Libsphinxclient Headers
Libsphinxclient in Csft/api/libsphinxclient
CD libsphinxclient/
./configure
Make && make install
After installing Libsphinxclient, continue installing the Sphinx extension
cd/www/soft/sphinx-1.1.0
/usr/local/php/bin/phpize
./configure--with-php-config=/usr/local/php/bin/php-config
Make && make install
cd/usr/local/php/lib/php/extensions/no-debug-non-zts-20060613/
See sphinx.so
Vi/usr/local/webserver/php/etc/php.ini
Join extension = sphinx.so
/usr/local/apache2/bin/apachectl restart
Test Sphinx Module, http://127.0.0.1/phpinfo.php

(2), using the API class to connect the Sphinx program
Need to find sphinxapi.php file in Coreseek decompression package, copy to program directory

Include ' sphinxapi.php '; $sphinx = new Sphinxclient (); $sphinx->setserver ("localhost", 9312); #建立连接, The first parameter sphinx the server address, the second sphinx listening port $result = $sphinx->query ($keyword, "*"), #执行查询, the first parameter queries the keyword, the second query's index name, multiple index names, separate, You can also use * to indicate all indexes, including incremental index print_r ($result);

Get an array structure similar to the following
[matches] = = Array (//Match result
[1] = = Array
[Weight] = 4
[Attrs] = = Array
[GROUP_ID] = 1
[Date_added] = 1319127367
#一个多维的数组, subscript [6] is the corresponding array that matches the document containing the keyword id,id, [weight] is the weight, [attrs] is the property, we specify in the configuration file
[Total] = 2 #此查询在服务器检索所得的匹配文档总数
[Total_found] = 2 #索引中匹配文档的总数
[Time] = 0.009 #这个是我们这次查询所用的时间
[words] = = Array (
[Test] = = Array
[Docs] = 2 How many times in the document (Content field)
[hits] = 61 total number of occurrences

while ($row = $result->fetch_row ()) {#循环体开始解析看下结果. Highlight we need to use buildexcerpts this function, PHP manual syntax format: public array sphinxclient: : buildexcerpts (array $docs, String $index, String $words [, array $opts]) #返回的是一个数组, a total of four parameters # the first parameter is a result set queried from the database # The second argument is the index The name # The third argument is the keyword to be highlighted # The fourth parameter is the word formatted $opts = array (#格式化摘要, highlight font settings # The string that was inserted before the match keyword, the default is <b> "Before_match" = "< Span style= ' font-weight:bold;color:red ' >, #在匹配关键字之后插入的字符串, Default is </b> "After_match" and "</span>", # The string that was inserted before the summary paragraph is default? " Chunk_separator "=" ... ","); $res = $sphinx->buildexcerpts ($row, "index", $keyword, $opts); echo "<font size=4>". $res [0]. " </font></a></br> "; Title echo "<font size=2>". $res [1]. " </font></br> "; Abstract echo $res [2]. " </p> "; Add Time}

Three, matching mode
Match mode: Setmatchmode (set matching mode)
Prototype: function Setmatchmode ($mode)
Sph_match_all matches all query terms (default mode).
Sph_match_any matches any one of the query terms.
Sph_match_phrase the entire query as a phrase that requires a complete match in order.
Sph_match_boolean the query as a Boolean expression.
sph_match_extended the query as an expression Sphinx the internal query language.
Sph_match_fullscan uses a full scan, ignoring the query vocabulary.
Sph_match_extended2 similar to sph_match_extended and supports scoring and weighting

Iv. Sphinx Real-time index (old version)
"Primary index + incremental index" idea
1. Create a counter
A simple implementation is to add a count table to the database, record the document ID that divides the document set into two parts, and update the table each time the primary index is rebuilt
First, insert a count table in MySQL
CREATE TABLE sph_counter (counter_id integer PRIMARY KEY not null, max_doc_id integer NOT NULL);

2. Modify the configuration file again
The primary data source, inheriting the data source, the primary index, and inheriting the index. (An inherited index is also a Delta index).
Main data Source: We need to change the query statement to the following statement:
Vi/usr/local/coreseek/etc/csft.conf

Source main{#加一句sql_query_presql_query_pre = REPLACE into Sph_counter Select 1, MAX (ID) from post# and modifies sql_query= SELECT ID, Title, content from Post WHERE id<= (SELECT max_doc_id from Sph_counter where counter_id=1)} inherits Data source: Source DELTA:MAIN{SQ L_query_pre = SET NAMES utf8sql_query= Select Id,title, content from post WHERE id> (SELECT max_doc_id from Sph_counter WHERE counter_id=1)} Primary index: #名字最好与数据源相应Index main {Source = Mainpath =/usr/local/coreseek/var/data/main} Inherit index (also Delta index) Index delta:main{source= deltapath=/usr/local/coreseek/var/data/delta}

Note:If you have only id,content three entries in the source configuration of your incremental index
The source configuration of the primary index has an ID, title,content four items, the combination will be reported when the number of attributes mismatch, such as:
Delta:sql_query = SELECT ID, title,content from post
Main:sql_query=select id,title,date,content from Post

3. Test Incremental Index + primary index
If you want to test whether the incremental index is successful, insert data into the database table, find out if it can be retrieved, this time the retrieval should be empty, and then rebuild the incremental index separately
/usr/local/coreseek/bin/indexer-c/usr/local/coreseek/etc/csft.conf Delta
See if the new record is indexed, if successful
At this point, the/usr/local/coreseek/bin/search tool is used to retrieve it, and it can be seen that the result retrieved in the primary index is 0, and the result is retrieved in the increment. The precondition, of course, is that the retrieved word exists only in the data that is subsequently inserted

4. Update the index in real time
We need to build two scripts and use the Scheduled tasks
Create a script for primary and incremental indexes
Main.sh delta.sh

Write down delta.sh in the incremental index
#!/bin/bash
#delta. Sh
/usr/local/coreseek/bin/indexer Delta >>/usr/local/coreseek/var/log/delta.log

The main index is written down: main.sh means merging indexes
#!/bin/bash
#main. Sh
/usr/local/coreseek/bin/indexer Main >>/usr/local/coreseek/var/log/merge.log
Finally, we need the script to run automatically to allow incremental indexing to be re-established every 5 minutes, and the primary index to be re-established only at 2:30.
The script is written, we need to set up a planning task
Crontab-e
*/10 * * * */usr/local/coreseek/etc/delta.sh
2 * * */usr/local/coreseek/etc/main.sh
Script permissions:
chmod a+x delta.sh
chmod a+x main.sh
can view log files

Live Index (new)
According to the official, it is now ready to be used in the production environment
He can use SPHINXQL to query with MySQL protocol to add updated data
Looks like a MySQL, but he supports full-text indexing, and the newly updated data is automatically indexed to the real-time index.
But he also has shortcomings, such as frequent updates that can cause memory growth
If the data in memory is not written to the hard disk in time, the interrupt will lose data
Only partial SQL statements are supported, and the previous period was tested briefly
If the amount of data is huge, he will become slow and will be stuck for a few seconds when the memory data is written to the hard drive.

Index rt{    type = rt    Rt_mem_limit = 512M    path = @[email protected]/data/rt    rt_field = title    Rt_field = Content    Rt_attr_uint = gid}searchd{Listen = 9306:mysql41 #searchd支持mysql协议连接的端口 max_matches = #在my The data queried in the SQL protocol will only return 3,000, even with the limit statement}

The Sphinx Real-time index configuration itself does not require a data source (sources), and its data is the way that the program takes advantage of the MYSQL41 protocol.
Mysql-p 9306-h 127.0.0.1 Connection Sphinx for testing

v. Distributed indexing
distributed to improve query latency and increase throughput in multi-server, multi-CPU, or multi-core environments, it is critical for search applications on large amounts of data (i.e., level 1 billion records and terabytes of text)
Distributed thinking: Horizontally partitioning the data (hp,horizontally partition), then processing it in parallel,
When Searchd receives a query for a distributed index, it does the following:
1. Connect to the remote agent.
2. Execute the query.
3. Query the local index.
4. Receive search results from remote agents.
5. Merge all results and delete duplicates.
6. Return the merged results to the client.
Index Dist
{
Type = Distributed
Local = Chunk1
Agent = Localhost:9312:chunk2 Local
Agent = 192.168.100.2:9312:chunk3 Remote
Agent = 192.168.100.3:9312:CHUNK4 Remote
}

Six. PHP code example

$keyword =$_post[' word ']; $sphinx =new sphinxclient (); $sphinx->setserver ("localhost", 9312); $sphinx Setmatchmode (Sph_match_any);//$sphinx->setlimits (0,0), $result = $sphinx->query ("$keyword", "*");//echo "< Pre> ";//print_r ($result);//echo" </pre> ", $ids =join (", ", Array_keys ($result [' matches ']); Mysql_connect (" LocalHost "," root "," root "), mysql_select_db (" test "), $sql =" SELECT * from post where ID in ({$ids}) "; Mysql_query (" Set Names UTF8 "), $rst =mysql_query ($sql); $opts =array (" before_match "=" <button style= "font-weight:bold;color:# F00 ' > ', ' after_match ' = ' </button> ', while ($row =mysql_fetch_assoc ($rst)) {$rst 2= $sphinx Buildexcerpts ($row, "main", $keyword, $opts), echo "{$rst 2[0]}<br>"; echo "{$rst 2[1]}<br>"; echo "{$rst 2[2] }<br> ";}

For more information about Sphinx, please see the manual: Http://sphinxsearch.com/docs/current.html
For more information on Coreseek, please see the manual: Http://www.coreseek.cn/docs/coreseek_4.1-sphinx_2.0.1-beta.html

Sphinx and Coreseek

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More