Sphinx installation process and working experience with PHP

Source: Internet
Author: User
Tags explode fread mysql index

1. What is Sphinx

Sphinx is a high-performance full-Text search package developed by Russian Andrew Aksyonoff, which is issued under the GPL and Commercial agreement dual license agreement.

Full text search refers to a kind of information retrieval technology which takes all the textual information of the document as the retrieval object. The retrieved object may be the title of the article, or it may be the author of the article, or it may be a summary or content of the article. Often used for vague queries about news, forum reviews, etc.

Features of the 2.Sphinx
    • High-speed index (on the new CPU, near 10mb/s);
    • High-Speed Search (2-4g text volume average query speed of less than 0.1 seconds);
    • High availability (up to 100GB of text on a single CPU, 100M documents);
    • Provide a good ranking of relevance;
    • Provide document summary generation;
    • Provides search from the internal MySQL plug-in storage engine;
    • Support Boolean, phrase, and synonyms query;
    • Support for single-byte encoding and UTF-8 encoding and so on.
3. Download Coreseek

Ps:coreseek is based on Sphinx development, Sphinx only support more Mysql,coreseek support, in addition to support Chinese word segmentation.

http://www.coreseek.cn/opensource/mmseg/

4. Install the operating system Base Development library and MySQL dependent libraries to support MySQL data sources and XML data sources

  Yum install make gcc g++ gcc-c++ libtool autoconf automake imake mysql-devel libxml2-devel expat-devel

(PS: Small series of Linux system centos5.6 test, operating system version of the different, need to install a different library, specific click)

5. Install (Compile and install) 5.1 unzip

TAR-ZXVF coreseek-3.2.14.tar.gz

CD coreseek-3.2.14

5.2 Installing MMSEG

CD mmseg-3.2.14

./bootstrap #输出的warning信息可以忽略, but error will need to be resolved

./configure--prefix=/usr/local/mmseg

Make && make install

Cd..

5.3 Installing Coreseek

CD csft-3.2.14

SH buildconf.sh #输出的warning信息可以忽略, but error will need to be resolved

./configure--prefix=/usr/local/coreseek--without-unixodbc--with-mmseg--with-mmseg-includes=/usr/local/mmseg/ include/mmseg/--with-mmseg-libs=/usr/local/mmseg/lib/--with-mysql

Make && make install

6. Configure the Test

Cp/usr/local/coreseek/etc/sphinx-min.conf.dist/etc/csft.conf

/usr/local/coreseek/bin/indexer-c/etc/csft.conf

Shown below

  

7. Configure csft.conf

Vim/etc/csft.conf

  

#Configuration SourceSOURCE Sphinx_t0#Database name _ data table name, each configuration of a data table, you need to write a configuration source{type=MySQL #Database TypeSql_host=localhost sql_user=Root Sql_pass= 123123sql_db= Sphinx#Specify the databaseSql_port = 3306#Optional, default is 3306Sql_sock =/tmp/MySQL. Sock#MySQL Interface        #SQL statement settings for reading data from a database    #where possible not to use where or GroupBy,    #GroupBy the where and the content to Sphinx, the Sphinx is more efficient for conditional filtering and GroupBy .    #Note: The field for select must include a unique primary key and fields to be retrieved in full-text (can have more than one), and output.     #Select the field you want to use in the where    #Example:    #in the configuration SQL statement, you can write an expected SQL statement to execute, and then set the Sphinx according to the    #select * from t0 where description like '% Guangzhou ' or name like '%s% '    #= = Select Id,description,name,age from t0Sql_query=SELECT ID, name, age, description,group_id,date_added from t0 sql_attr_uint= Age#use Sql_attr to set the field (search condition), only as a property, using Sphinxclient::setfilter () to filter;    #fields that are not set, automatically as fields for full-text retrieval, full-text search using Sphinxclient::query ("search string")    #sql_query the first column ID needs to be an integer and is used by the system without setting the Sql_attr_uintSql_attr_uint=group_id Sql_attr_timestamp= date_added#define different types of fields to use different property names, such as the sql_attr_timestamp above is the timestamp type    #Sql_query_info = SELECT * from documents WHERE id= $id #命令行查询时 to read raw data information from the database    #The SQL command executed before executing the sql_query can have more than oneSql_query_pre = SET NAMES UTF8#Execute SQL character encoding}#index, each source requires an indexIndex Sphinx_t0#The index name is generally consistent with the configuration source{Source= Sphinx_t0#Source Association sourcesPath =/usr/local/coreseek/var/data/sphinx_t0#The index file holds the path, one for each index fileDocInfo =extern Charset_dictpath=/usr/local/mmseg/etc/#specifies that the word breaker reads the location of the dictionary file, which is required when word breaker is enabled. When using libmmseg as the word breaker, you need to make sure that the dictionary file uni.lib in the specified directoryCharset_type = Zh_cn.utf-8#character encoding}#index, controlling all indexesindexer{Mem_limit= 512M#Memory}#Sphinx Daemon Configurationsearchd{Port= 9312#Port    Log=/usr/local/coreseek/var/Log/searchd.LogQuery_log=/usr/local/coreseek/var/Log/query.LogRead_timeout= 5#timed outMax_children = 30#Maximum number of connectionsPid_file =/usr/local/csft/var/Log/searchd.pid#pid File pathmax_matches = 1000#max_matches The maximum number of matches, that is, to find more data to return only the 1000 set hereSeamless_rotate = 1preopen_indexes= 0Unlink_old= 1}

Build index/usr/local/coreseek/bin/indexer-c/etc/csft.conf sphinx_t0

Start/usr/local/coreseek/bin/searchd-c/etc/csft.conf

At this point, Sphinx Basic installation is complete, the next is to cooperate with PHP

8. Use with PHP

  

<?PHPHeader(' Content-type:text/html;charset=utf-8 ');//encoded as Utf-8    include' sphinxapi.php ';//load Sphinx API, download online    $list=Array(); if(!Empty($_post)){        $SC=NewSphinxclient ();//instantiating the API        $SC->setserver (' 192.168.1.108 ', 9312);//set service side, first parameter Sphinx server address, second Sphinx listening Port        $res=$SC->query ($_post[' Key '], ' sphinx_t0 ');//execute the query, the first parameter of the query keyword, the index name of the second query, the MySQL index name (this is also defined in the configuration file), multiple index names are separated, or you can use * to represent all indexes. Print_r ($SC);        Print_r($res);Exit; }><form action= "" method= "post" ><input type= "text" name= "key"/><input type= "Submit" value= "Submit"/ ></form>
?>

  Note: If the result of the output is blank, print Print_r ($SC) to see the error code.

Error code 10060: Server firewall does not have open 9312 port, directly shut down the firewall can service iptables stop.

Error code 10061: It is possible that the index did not build successfully, construct a good dictionary, rebuild the index.

9. Updating the structure of dictionaries and dictionaries

9.1 Download the required thesaurus from the Sogou input judge net

9.2 Use dark blue Thesaurus converter to convert downloaded thesaurus to TXT file, deep Blue Word store Google Deep Blue thesaurus GitHub

9.3 Convert a file into a TXT file that conforms to the dictionary rules with PHP code ( Note: The newly generated file content Words_new.txt has no intersection with the original file unigram.txt content, so you can append the new file contents to the Unigram.txt, execute the command CAT/ Usr/local/mmseg/etc/words_new.txt >>/usr/local/mmseg/etc/unigram.txt)

<?PHPIni_set(' Display_errors ', ' on ');error_reporting(E_all);d Ate_default_timezone_set (' Asia/shanghai ');Set_time_limit(0);$buffer=Ini_get(' output_buffering ');if($buffer){    Ob_end_flush();}//Note that the encoding format of the file is UTF8Echo' Processing new thesaurus ... '.Php_eol;//php_eol windwos equivalent to ' \ r \ n ', unix\linux equivalent to ' \ n 'Flush();$filename= "Words.txt";$handle=fopen($filename, "R");$content=fread($handle,filesize($filename));fclose($handle);$content=Trim($content);$arr 1=Explode("\ r \ n",$content );//Print_r ($arr 1); exit;$arr 1=Array_flip(Array_flip($arr 1));//reverses the key value of an array, rejecting duplicate valuesforeach($arr 1  as $key=$value){        $value= Dealchinese ($value); if(!Empty($value)){        $arr 1[$key] =$value; }Else{        unset($arr 1[$key]); }}//Print_r ($arr 1); exit;Echo' Process the original thesaurus ... '.Php_eol;Flush();$filename 2= "Unigram.txt";$handle 2=fopen($filename 2, "R");$content 2=fread($handle 2,filesize($filename 2));fclose($handle 2);$content 2= Dealchinese ($content 2, "\ r \ n");$arr 2=Explode("\ r \ n",$content 2 );Echo' Delete the same terms ... '.Php_eol;Flush();$array _diff=Array_diff($arr 1,$arr 2);Echo' Format Thesaurus ... '.Php_eol;Flush();$words= ' ';foreach($array _diff  as $k=$word){    $words.=$word." \t1 ".Php_eol." X:1 ".Php_eol;}//echo $words;file_put_contents(' Words_new.txt ',$words, file_append);//Write FileEcho' done! ';functionDealchinese ($str,$join= ' '){       Preg_match_all('/[\x{4e00}-\x{9fff}]+/u ',$str,$matches);//match Chinese characters to all    $str=Join($join,$matches[0]);//regroup from matching results    return $str;}?>

9.4 Refreshing Chinese participle

Cd/usr/local/mmseg/bin

./mmseg-u. /etc/unigram.txt produces a file named Unigram.txt.uni

Cd..

CD etc

MV Unigram.txt.uni uni.lib Change the file name to Uni.lib, complete the construction of the dictionary

10. Rebuild the index and restart the service

/usr/local/coreseek/bin/searchd-c/etc/csft.conf--stop #停止searchd服务

/usr/local/coreseek/bin/indexer-c/etc/csft.conf--all-rotate #重生所有索引

/usr/local/coreseek/bin/searchd-c/etc/csft.conf #启动searchd服务

Note: If a segmentation fault error occurs during index generation, you can troubleshoot the following three ways

Situation one, Uni.lib file path error or cannot find the file

Case two, unigram.txt file is too large, generally not more than 20W data

situation three, using editor such as Notepad to edit the Unigram.txt file, resulting in the dictionary file format is incorrect (preferably use no tepad++ Open)

Situation four, Unigram.txt file content layout is not uniform, the end of the file wrapped too much (tested, can only empty a line).

At this point, over

Sphinx installation process and working experience with PHP

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.