How to install and use Sphinx-for-chinese under Windows

Source: Internet
Author: User

Sphinx-for-chinese use method will use Sphinx-for-chinese-2.2.1-dev-r4311-win32 as an example, currently I only find the latest is this version released in 2013.11.09.
: http://sphinxsearchcn.github.io/


After downloading, extract the following files.

Copy the bin directory and all the files to your favorite installation directory, such as D or E, where I put the D-Disk Sphinx-for-chinese folder (name Any, you can take other English name)
Then build a etc folder in the Sphinx-for-chinese directory and copy the sphinx.conf.in to the ETC directory you just created, then rename the sphinx.conf.in to sphinx.conf. Finally my D-disk Sphinx-for-chinese directory file structure is as follows:


Sphinx-for-chinese is now installed as a Windows service program so that the system will run itself Sphinx-for-chinese when it starts.
1. Open the cmd form for Windows, which is the command line prompt form (location: Start menu-All Programs-attachments), remember to run as administrator. (Right-click on the image to run as an administrator).


Enter the following string in the Cmd window, and remember that the directory you installed is not the same as me.
D:\sphinx-for-chinese\bin\searchd--install--config D:\sphinx-for-chinese\etc\sphinx.conf--servicename SPHINX-CN

--SERVICENAMESPHINX-CN This paragraph everyone is optimistic,--servicename behind can be you want English name.


The command to delete the service is: D:\sphinx-for-chinese\bin\searchd--delete--servicename SPHINX-CN

2. Now, to see if the service has been successfully installed. On the desktop, find the "My Computer" or "Computer" icon, and then right-click on it will pop up a menu. The Server Manager interface appears when you click the Manage menu.


Server Manager interface, locate and expand configuration to see a service, then click on the service and a large list will appear on the right.


In the list on the right, see if there is a service called SPHINX-CN name, and if so, it means that the installation is successful, but it does not necessarily mean that the service is running.

You see, no run, estimates are not running up, double click, open Look, as.



We see that the startup type is automatic, but the operation status is: stopped. You can click on the "Start" button to see if it can start, if it could start, then congratulations, when so far I can not start. Some configuration is required to start.

In the Sphinx-for-chinese directory, the new Data folder and a log folder, one will be used, the name can be any (in English or pinyin name), but the following must be filled with the same.
In the cmd form, enter the following command to see what information appears, thereby excluding why the above service cannot start.
D:\sphinx-for-chinese\bin\searchd--config D:\sphinx-for-chinese\etc\sphinx.conf

The above shows that the file or directory was not found when creating the PID file. The next thing to do is to find the @[email protected]/log/searchd.pid found in the sphinx.conf file and modify it to
Pid_file = D:/sphinx-for-chinese/log/searchd.pid
Then execute the cmd command just now, press the keyboard up arrow on the cmd form, and it will appear automatically, before the command entered.
D:\sphinx-for-chinese\bin\searchd--config D:\sphinx-for-chinese\etc\sphinx.conf

The following prompt appears, and just like that, the search string @[email protected]/log/searchd.log and then modifies.

L Modified before: Log [email Protected]@/log/searchd.log
Modified: Log = D:/sphinx-for-chinese/log/searchd.log

After you modify the save, and then continue to run the cmd command just now, the process will be very long until normal. After the direct mapping, not to say.


L Modified before: Query_log = @[email Protected]/log/query.log
Modified: Query_log = D:/sphinx-for-chinese/log/query.log

Let's try the cmd command again, and we'll find that the hint is different from the previous one, and look at the figure below.

Looks normal, but there is no index, this one will say, first open the Windows Process Manager, to see if the searchd.exe process is not in.

As shown in the figure, the Searchd.exe is already running, so let's force this process to end (I don't have to say it at the end of the process) and try the service startup instead.

In the Server Manager interface, find SPHINX-CN and double-click the service, a small form will pop up with a launch button, click on the "Start" button, the final effect is as follows.


Well, as of now, the initial installation is successful, and the next step is to open the sphinx.conf and make some configuration.

Open sphinx.conf and replace all @[email protected]/data/strings for d:/sphinx-for-chinese/data/and then save.


Then, stop the SPHINX-CN service, and then run it with CMD.
D:\sphinx-for-chinese\bin\searchd--config D:\sphinx-for-chinese\etc\sphinx.conf
See if there are any error messages.

As you can see, the difference is not indexed, in Task Manager, end the Searchd.exe process, and then
Sphinx.conf is set up, here will use my database table for example, you can modify it to your own.

SOURCE Src1 here Src1 can be renamed, the name of the other places to use SRC1 is the same as this
Configure Database Information
Type = MySQL
Sql_host = localhost
Sql_user = Test
Sql_pass =123456
sql_db = www.panshy.com
Sql_port = 3306 # optional, default is 3306


The following information, which is not in the configuration file, adds itself

Find sql_query_pre = SET NAMES UTF8 Remove the previous #号
Sql_query_info_pre = SET NAMES UTF8
Sql_query_info = SELECT * from www_panshy_com_ecms_pansharticle WHERE id= $id
\
Sql_query = SELECT ID, newstime as date_added, title, Newstext, Titleurl, id as MsgId, ClassID, Userid,user Name,username as softtype,username as FileSize, 1983 as DbType from Www_panshy_com_ecms_pansharticle
#sql_query第一列id需为整数

After you modify, save, and then run in the Cmd form, the following commands are indexed.
D:\sphinx-for-chinese\bin\indexer.exe--config D:\sphinx-for-chinese\etc\sphinx.conf--all
Normal, that's what it looks like.


Some files are generated under the data directory, such as



Finally run D:\sphinx-for-chinese\bin\searchd--config D:\sphinx-for-chinese\etc\sphinx.conf in cmd to see the effect again, the normal diagram:


So far, the Sphinx-for-chinese Basic installation configuration is complete. Next is the integration of Chinese participle.
Download xdict_1.1.tar.gz (download in the original link)

 



Unzip to the D:\SPHINX-FOR-CHINESE\ETC directory to get a xdict_1.1.txt file.


In the cmd form, run the following command to convert.
D:\sphinx-for-chinese\bin\mkdict D:\sphinx-for-chinese\etc\xdict_1.1.txt D:\sphinx-for-chinese\etc\xdict

Get a xdict file



Modifying the sphinx.conf index configuration file
Find Charset_type = SBCs then remove or comment out this line
Add the following two items
Charset_type = Utf-8
Chinese_dictionary = D:/sphinx-for-chinese/etc/xdict





This completes the Chinese support configuration.

If an index RT error occurs, delete the index RT entry in the configuration file.

Specific Sphinx-for-chinese using the same method as the Sphinx English version, you can refer to the Sphinx official website User manual.

Original link: http://www.panshsoft.net/thread-3-1-1.html

Http://www.panshy.com/articles/201608/dev-2752.html


The following references Sphinx-for-chinese official text

http://sphinxsearchcn.github.io/

3. Some precautions Sphinx-for-chinese only support UTF-8 encoding, when the data source output data, do the conversion, when using MySQL generally need to add "SET nmaes UTF8" statement. There are two points to note when using Xmlpipe: one is to use CDATA tags in XML whenever possible to avoid special characters affecting XML parsing, and the other is to enable the xmlpipe_fixup_utf8=1 option in the Sphinx configuration. To avoid parsing errors due to illegal UTF-8 strings as much as possible.
To check if Chinese word breaker support is enabled, use the Search command, as in the following example:

./search-c. /etc/sphinx.conf share the wonderful side
Sphinx-for-chinese 2.1.0-dev (r3006)
Copyright (c) 2008-2011, sphinx-search.com

Using config file '. /etc/sphinx.conf ' ...
Index ' test1 ': Query ' share the highlights around ': returned 0 matches of 0 Total in 0.000 sec

Words
1. ' Share ': 6 documents, 7 hits
2. ' Side ': "Documents," hits
3. ' 's ': 5344 documents, 178743 hits
4. ' Wonderful ': 5 documents, 6 hits

You can see that each Chinese word is listed in the words, which indicates that the Chinese word breaker is enabled successfully.

When garbled, check whether the encoding of the data source is UTF-8, whether the call in the program API is UTF-8, and if it is a command line test, check that the terminal environment is UTF-8. Windows command-line environment is GBK, if you are testing under Windows ' Life > order lines, be aware of the encoding of the input data.

If the data source is not MySQL, but Oracle, plain text, or other data source, it can be indexed in a xmlpipe way. The method is to use a language that is easy to develop quickly, such as Php,python,ruby or LUA (c,c++, etc.) to read the data source, and then output the XML format data in the established format for Sphinx reading. In the case of 99.9%, Sphinx is able to index any data and does not require additional low-level processing.

4. Chinese search Optimization The Chinese word segmentation (segmentation) is generally required for full-text retrieval of Chinese. The process of Chinese word segmentation, commonly called tokenize, is to divide a piece of text into tokens, indexing each token with an inverted index (inverted index). Chinese and Latin language, such as the English word with a space to distinguish between, and Chinese no obvious word separation, which requires the algorithm to Chinese characters, and the accuracy of the word segmentation will affect the effect of Chinese search. For example, "study the Origin of life" if divided into "postgraduate" "Life" "origin", with the word "research" is not searchable, with "life" also can not search, if divided into "research" "Life" "origin", then use "research" and "life" are searchable. Similarly, if "Shanghai" is divided into "Shanghai" and "City", the "Shanghai" is not searchable.

In order to improve the search results, it is generally possible to:

Improve the accuracy of word segmentation. This is generally related to the word segmentation algorithm, and now commonly used dictionary-based word segmentation algorithm in the accuracy of the word segmentation is not very small (there will be no order of magnitude difference). In addition, the dictionary can be adjusted, for example, for the pharmaceutical site, you can add medicine thesaurus in the dictionary, for the Special Industry domain dictionary optimization, can also improve the word segmentation effect. About thesaurus, generally can refer to Sogou cell thesaurus.

The use of synonyms, synonyms processing. This part is mainly for the "restaurant" "Restaurant" or "Shanghai" "Shanghai" and other synonymous or synonym treatment. Now some of the index of the word segmentation algorithm using a multi-minute processing method, such as the "Shanghai" is divided into "Shanghai" "Shanghai" "City", so that "Shanghai" "shanghai" can be searched, but this will increase the number of tokens, increase index data, affect the search efficiency, Moreover, it is very difficult to control the granularity of the multi-points processing by the algorithm alone. There is also the practice of integrating the > Semantic thesaurus into full-text search, which increases the complexity of the search program and is not conducive to upgrading and fine-tuning. The proposed approach is to place synonyms and synonyms on the search perimeter, that is, to process and convert user-entered search statements, using Sphinx's search syntax. Specific practices, you can organize a synonym, synonyms thesaurus, the use of memory-based database storage, as a daemon or Web service interface, the user's search > input for preprocessing, not only the development of low cost, fast, and modular high, easy to adjust, conducive to upgrading.
5. Search performance optimization and high-availability fault-tolerant cluster construction when the index data is too large or the traffic is too large, you can:

Partitioning the index data is very similar to the database split table. That is, the data is partitioned horizontally or vertically, and some adjustments are made in the application, and different search requests are assigned to different indexes. The idea of this approach is to reduce the size of the individual index data block as much as possible, thereby reducing the size of the scanned data required for each request and increasing the response time.

A reasonable update strategy. According to the update frequency, the Main+delta two-layer processing or the Main+today+delta three-layer processing can reduce the update burden and improve the update speed of index data. This section usually requires specific analysis of specific problems.

Using distributed processing, the data is divided horizontally and distributed across multiple machines. This section can refer to http://sphinxsearch.com/docs/2.0.2/distributed.html.

High availability and fault tolerant processing. One is the use of replication processing method, that is, a machine as master responsible for index updates, do not accept external requests, and many other machines run Sphinx instances, as slave accept external requests. Master updates the index data on the slave via INotify and rsync, and external requests are distributed to multiple slave based on the algorithm for load balancing and fault tolerant processing. At the same time, Haproxy and VRRP can also be used to achieve high availability and fault-tolerant cluster > construction. Another method is to use Sphinx's own distributed processing method, and combine heartbeat or VRRP to implement fault-tolerant processing.

How to install and use Sphinx-for-chinese under Windows

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.