Sphinx installation and API Learning notes collation

Last Update:2017-01-13 Source: Internet

Author: User

Tags bz2

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Sphinx Installation

There are two ways to use Sphinx on MySQL:

1. Use API call, such as PHP, Java, etc. API function or method query. The advantage is that there is no need to recompile MySQL, the server process "low coupling", and the program can be flexible and convenient calls; The disadvantage is that if there is a search procedure, the need to modify part of the program. Recommended for programmers to use.

2. Use plug-in mode (SPHINXSE) to compile sphinx into a MySQL plugin and use specific SQL statements for retrieval. It is characterized by convenient combination at the SQL end, and can directly return data to the client. Without two queries, only the corresponding SQL needs to be modified on the program, but this is inconvenient for programs developed using the framework, such as using ORM. MySQL will also need to be recompiled, and need to mysql-5.1 above versions to support plug-in storage.

The installation here is mainly about the first way to call through the API. The installation of Sphinx is as follows:

The code is as follows

Copy Code

#下载最新稳定版

wget http://www.sphinxsearch.com/downloads/sphinx-0.9.9.tar.gz

Tar xzvf sphinx-0.9.9.tar.gz

CD sphinx-0.9.9

./configure--prefix=/usr/local/sphinx/--with-mysql--enable-id64

Make

Make install

Note: Installing in this way does not support Chinese participle.

Third, Sphinx Chinese word segmentation in Chinese full-text Search and English Latin series is different, the latter is based on special characters such as space words, and Chinese is based on semantics to participle. Chinese participle mainly has 2 plug-ins

1. Coreseek

Coreseek is now used most of the Sphinx Chinese Full-text Search, it provides for the Sphinx design of Chinese word packet libmmseg, is based on the Sphinx developed on the basis of.

2. SFC (Sphinx-for-chinese)

SFC (Sphinx-for-chinese) is another Chinese word-breaker provided by the friend brother Happy. One of the dictionary uses is xdict.

This section introduces the installation method of Coreseek

Coreseek (Sphinx for Chinese search) installation 1. Install upgrade autoconf

Because Coreseek need to autoconf more than 2.64 version, so need to upgrade autoconf, or it will error from http://download.chinaunix.net/download.php?id=29328& resourceid=648 download autoconf-2.64.tar.bz2, the installation method is as follows:

The code is as follows

Copy Code

TAR-JXVF autoconf-2.64.tar.bz2

CD autoconf-2.64

./configure

Make

Make install

2. Download Coreseek

The new version of Coreseek the dictionary and the Sphinx source program in a package, so just download the Coreseek package.

wget http://www.wapm.cn/uploads/csft/3.2/coreseek-3.2.14.tar.gz

3. Install mmseg (the dictionary used by Coreseek)

The code is as follows

Copy Code

Tar xzvf coreseek-3.2.14.tar.gz

CD mmseg-3.2.14

./bootstrap #输出的warning信息可以忽略, if the error occurs, you need to resolve

./configure--PREFIX=/USR/LOCAL/MMSEG3

Make && make install

Cd..

4. Installation of Coreseek (Sphinx)

The code is as follows

Copy Code

CD csft-3.2.14

SH buildconf.sh #输出的warning信息可以忽略, if there is error, you need to solve

./configure--prefix=/usr/local/coreseek--without-unixodbc--with-mmseg--with-mmseg-includes=/usr/local/mmseg3/ include/mmseg/--with-mmseg-libs=/usr/local/mmseg3/lib/--with-mysql

Make && make install

Cd..

5. Test mmseg participle and coreseek search

Note: You need to set the character set to ZH_CN. UTF-8, make sure that Chinese is displayed correctly, my system character set is en_US. UTF-8 is also a can.

The code is as follows

Copy Code

CD Testpack

Cat Var/test/test.xml #此时应该正确显示中文

/usr/local/mmseg3/bin/mmseg-d/usr/local/mmseg3/etc Var/test/test.xml

/usr/local/coreseek/bin/indexer-c etc/csft.conf--all

/usr/local/coreseek/bin/search-c etc/csft.conf

Network search

At this point the correct should return

Words

1. ' Network ': 1 documents, 1 hits

2. ' Search ': 2 documents, 5 hits

6. Generate MMSEG Word library and configuration file

The new version has been automatically generated.

Sphinx API complete usage and important attributes, tidy up, in case forget!

The code is as follows

Copy Code

$CL = new Sphinxclient ();
Installed default Host:localhost, Sphinx port: 3312
$cl->setserver ("localhost", "3312");
Optionally, set weights for each full-text search field, depending on the order of the fields you defined in Sql_query, the Sphinx system will adjust later, and you can set weights by field name. You can refer to Setfieldweights (Array (100, 1))
$CL->setweights (Array (100, 1));

The pattern of the query, in total, has the following pattern: sph_match_all, matching all query words (and) (default mode) Sph_match_any, matching any one of the query words (or) sph_match_phrase, the whole query as a phrase, Requires complete matching of sph_match_boolean in order, a query as a Boolean expression sph_match_extended, and a query as an expression Sphinx the internal query language. There is also a special "full scan" mode, which is automatically activated when the following conditions are met:
1. The query string is empty (that is, the length is 0)
2.docinfo storage mode is extern in full scan mode, all indexed documents are considered to be matched. Such matches are still filtered, sorted, or grouped, but do not do any real full-text searches. This pattern can be used to unify full-text and non full-text search code, or to reduce the burden on SQL Servers (sometimes Sphinx scans are faster than similar MySQL queries)
$CL->setmatchmode ("Sph_match_all");

Search only forum_id=1 or 3 or 7 if $cl->setfilter ("forum_id", Array (1,3,7), true); Indicates that only forum_id!=1 or!=2 or!=7 are searched
$CL->setfilter ("forum_id", Array (1,3,7));

Sph_groupby_day, extract the year, month, and day from the timestamp in YYYYMMDD format
Sph_groupby_week the first day of the year and the specified number of weeks (since the beginning of the year) from the timestamp in yyyynnn format
Sph_groupby_month, extract month from time stamp in YYYYMM format
Sph_groupby_year, the year//final search results are extracted from the timestamp in YYYY format. Each group contains a best bet. The grouping function values and the number of matches per group are returned in the form of "virtual" properties @group and @count respectively.
Sph_sort_relevance ignores any additional parameters and is always sorted by relevance rating. All the remaining patterns require an extra sort clause, and the syntax of the clause is related to the specific pattern.
$CL->setgroupby ("UserName", Sph_groupby_attr, $groupsort);

$CL->setgroupdistinct ($DISTINCT);
/*
$CL->setgroupby ("category", Sph_groupby_attr, "@count desc");
$CL->setgroupdistinct ("Vendor");
Equivalent:
SELECT ID, weight, all-attributes,
COUNT (DISTINCT Vendor) as @distinct,
COUNT (*) as @count
From Products
GROUP by Category
ORDER BY @count DESC
*/

Sph_sort_relevance mode, in descending order of relevance (best match in front) Sph_sort_attr_desc mode, sorted by attribute descending (the greater the property value is in the front) SPH_SORT_ATTR_ASC mode, In ascending order of property (the smaller the property value is in the front) sph_sort_time_segments mode, in descending order by time period (last hour/day/week/month), and then by correlation descending sph_sort_extended mode, by a similar SQL In a way that combines columns in ascending or descending order. sph_sort_expr mode, sorted by an arithmetic expression.

The code is as follows	Copy Code
$cl->setsortmode (sph_sort_extended, "post_date"); Starting from the No. 0, take $limit, and the third parameter limits the maximum offset to no greater than 1000 $cl->setlimits (0, $limit, ($limit >1000)? $limit: 1000);

//Set the scoring mode: * SPH_RANK_PROXIMITY_BM25, the default mode, with the phrase score and BM25 score, and combine the two. * SPH_RANK_BM25, Statistical correlation calculation mode, using only BM25 score calculations (same as most full-text search engines). This pattern is faster, but it may reduce the quality of the results of queries that contain multiple words. * Sph_rank_none, disable the scoring mode, which is the fastest mode. In fact, this pattern is the same as a Boolean search. All matches are given a weight of 1. * Sph_rank_wordcount, sorted according to the number of keyword occurrences. The sequencer calculates the number of occurrences of the keyword in each field, multiplies the count by the weight of the field, and sums the product as the final result. * sph_rank_proximity, version 0.9.9-rc1 new, the original phrase similarity as the result returned. Internally, this pattern is used to simulate sph_match_all queries. * Sph_rank_matchany, Version 0.9.9-rc1 added, returns the previous precedence in Sph_match_any, which is used to simulate sph_match_any queries inside this mode. * Sph_rank_fieldmask, version 0.9.9-rc2 new, return a 32-bit mask, where the N-bit corresponds to the nth full text segment, starting from 0 count, if a field appears in the query to meet the keyword, then the corresponding flag bit is placed 1.

code is as follows

copy code

$cl-> Setrankingmode ("SPH_RANK_PROXIMITY_BM25");

//php private. Controls the return format of the search result set (matches returned by array or hash) $arrayresult parameter should be Boolean. If $arrayresult is False (default), the match is returned in PHP hash format, the document ID is the key, and other information (weights, attributes) is the value. If $arrayresult is true, the matches are returned in an ordinary array, including all information about the match (with the document ID)
$cl->setarrayresult (true);

Connect to the SEARCHD server, execute the given query based on the current settings of the server, and get and return the result set. The $query is a query string, $index is one or more index names. If a general error occurs, it returns false and sets the GetLastError () information. If successful, returns the result set of the search. In addition, $comment will be sent to the front of the search section in the query log, which is useful for debugging. Currently, the annotation is limited to 128 characters in length. The default value for $index is "*", which means querying all local indexes. The characters allowed in the index name include the Latin alphabet (A-Z), the number (0-9), the minus sign (-) and the underscore (_), and other characters are treated as delimiters. Therefore, the following example calls are valid and search for the same two indexes:

The code is as follows	Copy Code
$res = $cl->query ($query, $index); /* $CL->query ("Test Query", "main Delta"); $CL->query ("Test Query", "Main;delta"); $CL->query ("Test Query", "Main, Delta"); */

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More