Sphinx full-text search engine

Source: Internet
Author: User

Official website and documentation: http://sphinxsearch.com/docs/

Http://www.sphinxsearch.org/

Python SDK: http://pypi.python.org/pypi/sphinxapi

Incremental index reference: http://blog.csdn.net/jianglei421/article/details/5431946

Alumni Zhang Peng's blog has written his use of Sphinx: http://blog.s135.com/post/360/

Theory:

1. Sphinx supports the MySQL Protocol. In addition to APIs encapsulated by the common sphinx SDK, you can also use the mysql client to initiate a connection and query to the searchd (using the sphinxql language)

2. Re-compile MySQL to compile the MySQL storage engine of sphinx. In this way, you can create a sphinxse data table to communicate with searchd during query and directly obtain the query result.

3. command line tool: Indexer/searchd/search /...

4. note the following when creating an incremental index: 1> the old incremental index file will be overwritten at the next index creation, therefore, the old incremental index should be merged into the full index or into a specified index before the next incremental index creation (and then merged to the full index regularly) 2> you need to write down the increment mark each time you create an increment index. The next increment starts from this mark, so it is repeated... if no records are marked, the incremental data to be processed will become slower and slower, and the incremental data will eventually lose the significance of timeliness.

5. chinese Word Segmentation needs to pay attention to the problem: Do a Good Job of configuration (configuration as shown below, refer to the self-http://www.sphinxsearch.org/sphinx-tutorial), Sphinx can be a single Chinese Word Segmentation and indexing; if you want to perform Word Segmentation Based on semantics, you need to install some plug-ins for Chinese word segmentation (for example, SFC Shenma)

Ngram_len = 1 # Cut ngram_chars = U + 4E00 for the length of non-letter data .. U + 9FBF, U + 3400 .. U + 4DBF, U + 20000 .. U + 2A6DF, U + f900 .. U + FAFF, \ U + 2F800 .. U + 2FA1F, U + 2E80 .. U + 2EFF, U + 2F00 .. U + 2FDF, U + 3100 .. U + 312F, U + 31A0 .. U + 31BF, \ U + 3040 .. U + 309F, U + 30A0 .. U + 30FF, U + 31F0 .. U + 31FF, U + ac00 .. U + D7AF, U + 1100 .. U + 11FF, \ U + 3130 .. U + 318F, U + a000 .. U + A48F, U + a490 .. U + A4CF

6. index A json string: Today (), a business requirement is to create a full-text index for the json string stored in the data table. There are two scenarios, one is the internal utf8 encoded utf8 json string, and the other is the utf8 json string that directly uses \ escape for the internal unicode encoding. The former is no different from our general processing method. The latter also needs to escape the search term in unicode and then search for it.

The following is the first practice of cainiao:

1. Installation (the rpm package of sphinx does not support custom installation paths -- prefix, error: package sphinx is not relocatable)

[dongsong@bogon ~]$ sudo rpm -i sphinx-2.0.4-1.rhel6.x86_64.rpmSphinx installed!Now create a full-text index, start the search daemon, and you're all set.To manage indexes:    editor /etc/sphinx/sphinx.confTo rebuild all disk indexes:    sudo -u sphinx indexer --all --rotateTo start/stop search daemon:    service searchd start/stopTo query search daemon using MySQL client:    mysql -h 0 -P 9306    mysql> SELECT * FROM test1 WHERE MATCH('test');See the manual at /usr/share/doc/sphinx-2.0.4 for details.For commercial support please contact Sphinx Technologies Inc athttp://sphinxsearch.com/contacts.html

You can create a symbolic link to link the local html manual to the apache directory for local help

[dongsong@bogon python_study]$ sudo ln -sf /usr/share/doc/sphinx-2.0.4/sphinx.html ./sphinx.html
http://172.26.16.100/sphinx.html

2. Use

[Root @ bogon SPhinX] # indexer -- config/etc/sphinx. conf spidersph00002.0.4-id64-release (r3135) Copyright (c) 2001-2012, Andrew aksyonoffcopyright (c) 2008-2012, sphsf-technologies Inc (http://sphinxsearch.com) using Config File '/etc/sphexample. conf '... indexing index 'spider '... warning: attribute 'id' not found-ignoringwarning: attribute count is 0: switching to none docinfocollected 20011 docs, 115.0 mbsorted 5.4 mhits, 100.0% donetotal 20011 docs, 115049820 bytestotal 33.003 sec, 3486001 Bytes/sec, 606.33 docs/sectotal 2 reads, 0.123 sec, 25973.1 KB/call AVG, 61.9 msec/call avgtotal 188 writes, 2.964 sec, 585.8 KB/call AVG, 15.7 msec/call AVG [root @ bogon SPhinX] # search-C/etc/sphinx. conf China sphsf-2.0.4-id64-release (r3135) Copyright (c) 2001-2012, Andrew aksyonoffcopyright (c) 2008-2012, sphsf-technologies Inc (http://sphinxsearch.com) using Config File '/etc/sphexample. conf '... index 'spider ': Query 'China': returned 150 matches of 150 total in 0.000 secdisplaying matches: 1. document = 719806, Weight = 26542. document = 1397236, Weight = 26543. document = 3733569, Weight = 17294. document = 13384, Weight = 17225. document = 3563788, Weight = 17056. document = 3742995, Weight = 17057. document = 17777, Weight = 16988. document = 3741757, Weight = 16989. document = 3888109, Weight = 169810. document = 2472909, Weight = 168911. document = 3741705, Weight = 168912. document = 2145250, Weight = 167613. document = 2600863, Weight = 167614. document = 3561074, Weight = 167615. document = 3737639, Weight = 167616. document = 3746591, Weight = 167617. document = 3805049, Weight = 167618. document = 1822, Weight = 165419. document = 7755, Weight = 165420. document = 13399, Weight = 1654 words: 1. 'China': 150 documents, 237 hits

3. data can be searched in search, and an error is returned in api search (the searchd process needs to be started in api search. No error is reported during the START process of searchd, but no listening is reported on the specified port or the actual searchd process exists)

[Dongsong @ bogon api] $ vpython test. py-h localhost-p 9312-I spider China query failed: connection to localhost; 9312 failed ([Errno 111] Connection refused)

View the log file location of searchd in/etc/sphinx. conf.

searchd{        listen                  = 9312        listen                  = 9306:mysql41        log                     = /var/log/sphinx/searchd.log        query_log               = /var/log/sphinx/query.log        read_timeout            = 5        max_children            = 30        pid_file                = /var/run/sphinx/searchd.pid        max_matches             = 1000        seamless_rotate         = 1        preopen_indexes         = 1        unlink_old              = 1        workers                 = threads # for RT to work        binlog_path            = /var/data}

Open the log file/var/log/sphenders/search. log to find the root cause of the problem.

[Fri Jun 15 10:28:44.583 2012] [ 7889] listening on all interfaces, port=9312[Fri Jun 15 10:28:44.583 2012] [ 7889] listening on all interfaces, port=9306[Fri Jun 15 10:28:44.585 2012] [ 7889] FATAL: failed to open '/var/data/binlog.lock': 2 'No such file or directory'[Fri Jun 15 10:28:44.585 2012] [ 7888] Child process 7889 has been forked[Fri Jun 15 10:28:44.585 2012] [ 7888] Child process 7889 has been finished, exit code 1. Watchdog finishes also. Good bye![Fri Jun 15 10:29:09.968 2012] [ 7905] Child process 7906 has been forked[Fri Jun 15 10:29:09.970 2012] [ 7906] listening on all interfaces, port=9312[Fri Jun 15 10:29:09.970 2012] [ 7906] listening on all interfaces, port=9306[Fri Jun 15 10:29:09.987 2012] [ 7906] FATAL: failed to open '/var/data/binlog.lock': 2 'No such file or directory'[Fri Jun 15 10:29:09.993 2012] [ 7905] Child process 7906 has been finished, exit code 1. Watchdog finishes also. Good bye!

Comment out the binlog_path configuration.

[Dongsong @ bogon api] $ vpython test. py-h localhost-p 9312-I spider China Query 'China' retrieved 3 of 3 matches in 0.005 secQuery stats: 'China' found 4 times in 3 documentsMatches: 1. doc_id = 5, weight = 1002. doc_id = 80, weight = 1003. doc_id = 2012, weight = 100

4. For retrieval of Chinese data, the following item cannot be found if it is not set in the conf index.

charset_table = U+FF10..U+FF19->0..9, 0..9, U+FF41..U+FF5A->a..z, U+FF21..U+FF3A->a..z,\A..Z->a..z, a..z, U+0149, U+017F, U+0138, U+00DF, U+00FF, U+00C0..U+00D6->U+00E0..U+00F6,\U+00E0..U+00F6, U+00D8..U+00DE->U+00F8..U+00FE, U+00F8..U+00FE, U+0100->U+0101, U+0101,\U+0102->U+0103, U+0103, U+0104->U+0105, U+0105, U+0106->U+0107, U+0107, U+0108->U+0109,\U+0109, U+010A->U+010B, U+010B, U+010C->U+010D, U+010D, U+010E->U+010F, U+010F,\U+0110->U+0111, U+0111, U+0112->U+0113, U+0113, U+0114->U+0115, U+0115, \U+0116->U+0117,U+0117, U+0118->U+0119, U+0119, U+011A->U+011B, U+011B, U+011C->U+011D,\ U+011D,U+011E->U+011F, U+011F, U+0130->U+0131, U+0131, U+0132->U+0133, U+0133, \U+0134->U+0135,U+0135, U+0136->U+0137, U+0137, U+0139->U+013A, U+013A, U+013B->U+013C, \U+013C,U+013D->U+013E, U+013E, U+013F->U+0140, U+0140, U+0141->U+0142, U+0142, \U+0143->U+0144,U+0144, U+0145->U+0146, U+0146, U+0147->U+0148, U+0148, U+014A->U+014B, \U+014B,U+014C->U+014D, U+014D, U+014E->U+014F, U+014F, U+0150->U+0151, U+0151, \U+0152->U+0153,U+0153, U+0154->U+0155, U+0155, U+0156->U+0157, U+0157, U+0158->U+0159,\ U+0159,U+015A->U+015B, U+015B, U+015C->U+015D, U+015D, U+015E->U+015F, U+015F, \U+0160->U+0161,U+0161, U+0162->U+0163, U+0163, U+0164->U+0165, U+0165, U+0166->U+0167, \U+0167,U+0168->U+0169, U+0169, U+016A->U+016B, U+016B, U+016C->U+016D, U+016D, \U+016E->U+016F,U+016F, U+0170->U+0171, U+0171, U+0172->U+0173, U+0173, U+0174->U+0175,\ U+0175,U+0176->U+0177, U+0177, U+0178->U+00FF, U+00FF, U+0179->U+017A, U+017A, \U+017B->U+017C,U+017C, U+017D->U+017E, U+017E, U+0410..U+042F->U+0430..U+044F, \U+0430..U+044F,U+05D0..U+05EA, U+0531..U+0556->U+0561..U+0586, U+0561..U+0587, \U+0621..U+063A, U+01B9,U+01BF, U+0640..U+064A, U+0660..U+0669, U+066E, U+066F, \U+0671..U+06D3, U+06F0..U+06FF,U+0904..U+0939, U+0958..U+095F, U+0960..U+0963, \U+0966..U+096F, U+097B..U+097F,U+0985..U+09B9, U+09CE, U+09DC..U+09E3, U+09E6..U+09EF, \U+0A05..U+0A39, U+0A59..U+0A5E,U+0A66..U+0A6F, U+0A85..U+0AB9, U+0AE0..U+0AE3, \U+0AE6..U+0AEF, U+0B05..U+0B39,U+0B5C..U+0B61, U+0B66..U+0B6F, U+0B71, U+0B85..U+0BB9, \U+0BE6..U+0BF2, U+0C05..U+0C39,U+0C66..U+0C6F, U+0C85..U+0CB9, U+0CDE..U+0CE3, \U+0CE6..U+0CEF, U+0D05..U+0D39, U+0D60,U+0D61, U+0D66..U+0D6F, U+0D85..U+0DC6, \U+1900..U+1938, U+1946..U+194F, U+A800..U+A805,U+A807..U+A822, U+0386->U+03B1, \U+03AC->U+03B1, U+0388->U+03B5, U+03AD->U+03B5,U+0389->U+03B7, U+03AE->U+03B7, \U+038A->U+03B9, U+0390->U+03B9, U+03AA->U+03B9,U+03AF->U+03B9, U+03CA->U+03B9, \U+038C->U+03BF, U+03CC->U+03BF, U+038E->U+03C5,U+03AB->U+03C5, U+03B0->U+03C5, \U+03CB->U+03C5, U+03CD->U+03C5, U+038F->U+03C9,U+03CE->U+03C9, U+03C2->U+03C3, \U+0391..U+03A1->U+03B1..U+03C1,U+03A3..U+03A9->U+03C3..U+03C9, U+03B1..U+03C1, \U+03C3..U+03C9, U+0E01..U+0E2E,U+0E30..U+0E3A, U+0E40..U+0E45, U+0E47, U+0E50..U+0E59, \U+A000..U+A48F, U+4E00..U+9FBF,U+3400..U+4DBF, U+20000..U+2A6DF, U+F900..U+FAFF, \U+2F800..U+2FA1F, U+2E80..U+2EFF,U+2F00..U+2FDF, U+3100..U+312F, U+31A0..U+31BF, \U+3040..U+309F, U+30A0..U+30FF,U+31F0..U+31FF, U+AC00..U+D7AF, U+1100..U+11FF, \U+3130..U+318F, U+A000..U+A48F,U+A490..U+A4CF

Charset won't be mentioned anymore ....

charset_type            = utf-8

5. Process incremental data

[root@bogon sphinx]# indexer --config /etc/sphinx/sphinx.conf spiderinc --rotateSphinx 2.0.4-id64-release (r3135)Copyright (c) 2001-2012, Andrew AksyonoffCopyright (c) 2008-2012, Sphinx Technologies Inc (http://sphinxsearch.com)using config file '/etc/sphinx/sphinx.conf'...indexing index 'spiderinc'...WARNING: attribute 'id' not found - IGNORINGWARNING: Attribute count is 0: switching to none docinfocollected 17 docs, 0.1 MBsorted 0.0 Mhits, 100.0% donetotal 17 docs, 87216 bytestotal 0.060 sec, 1444643 bytes/sec, 281.58 docs/sectotal 2 reads, 0.000 sec, 23.4 kb/call avg, 0.0 msec/call avgtotal 6 writes, 0.008 sec, 16.9 kb/call avg, 1.4 msec/call avgrotating indices: succesfully sent SIGHUP to searchd (pid=10459).

6. for indexes that are providing external services (indexes that have been requisitioned by searchd), adding -- rotate when you call indexer to create an index does not interrupt the Service (create a new index and overwrite the old one, index creation fails if no value is added)

If indexer -- rotate is used when an index is created for the first time, it will fail (because indexes of earlier versions can be overwritten)

7. Sample Code

# Query Cl = sphinxclient () Cl. setserver (host, Port) Cl. setweights ([100, 1]) Cl. setmatchmode (mode) If filtervals: CL. setfilter (filtercol, filtervals) If groupby: CL. setgroupby (groupby, sph_groupby_attr, groupsort) If sortby: CL. setsortmode (sph_sort_attr_desc, sortby) # Cl. setlimits (offset, limit, limit + offset) # If you are too efficient, you can enable this row to hit the required data and then immediately return markbyxds Cl. setlimits (offset, limit) Cl. setconnectti Meout (60.0) RES = Cl. query (query, index) if not res: Return httpresponse (JSON. dumps ({'page': Page, 'Count': Count, 'Total': 0, 'datas': []}) # retrieve the actual data from the database IDs = [Match ['id'] For match in res ['matches'] rawdatas = rawdata. objects. filter (ID _ in = IDs ). order_by ('-create_time') response = {'page': Page, 'Count': Count, 'Total': res ['Total'], 'datas ': []} response ['datas'] = construct_response_data (rawdatas) # generate a highlighted content Abstract ''' Cl. setconnecttimeout (5.0) bodydatas = [tmpdata ['data'] For tmpdata in response ['datas'] Try: Excerpts = Cl. buildexcerpts (bodydatas, 'spider ', query, {'before _ Match':' <span "style": "color: red;"> ', 'after _ Match ': '</span>', 'query _ mode': Mode}) cannot t exception, E: Import pdbpdb. set_trace () passlistindex = 0for excerpt in excerpts: Response ['datas'] [listindex] ['data'] = excerpt. decode ('utf-8') listindex + = 1 '''Cl. setconnecttimeout (0.1) for listindex in range (LEN (response ['datas']): tmpdata = response ['datas'] [listindex] ['data'] for retry in range (3): Try: Excerpt = Cl. buildexcerpts ([tmpdata], 'spider ', # returns an empty list spider with Multiple indexes; spiderinc markbyxds query, {'before _ Match':' <span style = "color: red; "> ', 'after _ Match':' </span> ', 'query _ mode': Mode}) handle T exception, E: logging. error ("% s: % s" % (type (E), STR (E) Excerpt = No Nebreakelse: If excerpt! = None: Response ['datas'] [listindex] ['data'] = excerpt [0]. decode ('utf-8') breakelse: logging. warning ('Return none. (timeout is too small), to retrying... ') Continueif excerpt = none: snippetlen = 1024 response ['datas'] [listindex] ['data'] = tmpdata [0: snippetlen] If Len (tmpdata)> snippetlen: response ['datas'] [listindex] ['data'] + = '... '# JSON string jsonstr = JSON. dumps (response, ensure_ascii = false) If isinstance (jsonstr, Unicode): jsonstr = jsonstr. encode ('utf-8') return httpresponse (jsonstr)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.