Sphinx incremental indexing for near real-time updates

Source: Internet
Author: User

I. Setting of the Sphinx Incremental Index
The data in the database is large, and new data is added to the database, and you want to be able to retrieve it. All re-indexing is very resource-intensive because we need to update the data relatively rarely. For example. The original data has millions of, and the new one is only thousands of. This allows for near real-time updates using the "primary index + Incremental index" pattern.

The basic principle of this pattern implementation is to set up two data sources and two indexes to establish the primary index for the data that is not updated, and to create incremental indexes for those new data. The update frequency of the primary index can be set a bit longer (for example, set at midnight per day), and the update frequency of the incremental index, we can set the time is very short (a few minutes or so), so that when the user searches, we can simultaneously query the data of both indexes.

Using the primary index + Incremental index method, there is a simple implementation of adding a count table to the database, which records the last data ID of the indexed table each time the primary index is rebuilt, so that only the data after that ID is indexed at the time of the incremental index, and the table is updated each time the primary index is rebuilt.

Test conditions: With the default sphinx.conf configuration as an example, the data for the database table is also example.sql.

1. Insert a Count table and two index tables in MySQL first

CREATE TABLE sph_counter (counter_id integer PRIMARY KEY not null, max_doc_id integer NOT NULL);

2. Modify Sphinx.conf

SOURCE main_src{

Type = MySQL

Sql_host = localhost

Sql_user = YourUserName

Sql_pass = YourPassword

sql_db = test//The database you are using

Sql_port = 3306//port used, default is 3306

Sql_query_pre = SET NAMES UTF8

Sql_query_pre = SET SESSION query_cache_type=off #下面的语句是更新sph_counter表中的 max_doc_id. Sql_query_pre = REPLACE into Sph_counter SELECT 1, MAX (ID) from documents

Sql_query = SELECT ID, group_id, Unix_timestamp (date_added) as date_added, title,\

Content from documents \

where id<= (SELECT max_doc_id from Sph_counter WHERE counter_id=1)

}

Note: The number of sql_query_pre in DELTA_SRC needs to correspond to MAIN_SRC, otherwise the result may not be searched

SOURCE delta_src:main_src{

Sql_ranged_throttle = 100

Sql_query_pre = SET NAMES UTF8

Sql_query_pre = SET SESSION Query_cache_type=off

Sql_query = SELECT ID, group_id, Unix_timestamp (date_added) as date_added, title, content from Documents\

where id> (SELECT max_doc_id from Sph_counter WHERE counter_id=1)

}

Index main//primary index {

Source = Main_src

Path =/path/to/main

# example:/usr/local/sphinx/var/data/main .......

Charset_type = Utf-8 #这个是支持中文必须要设置的

Chinese_dictionary =/usr/local/sphinx/etc/xdict # ..... Other can default

}

The delta can replicate all the primary indexes and then change the source and path as follows

Index Delta:main//incremental Indexes {

Source = Delta_src

Path =/path/to/delta

# Example:/usr/local/sphinx/var/data/delta ...

}

The other configuration can be used by default, if you set the index of distributed retrieval, then change the corresponding index name.

3. Re-establish the index:
If Sphinx is running, stop running first, then set up all indexes according to the sphinx.conf configuration file, and finally, start the service

/usr/local/sphinx/bin/searchd--stop/usr/local/sphinx/bin/indexer-c/usr/local/sphinx/etc/sphinx.conf--all/usr/ Local/sphinx/bin/searchd-c/usr/local/sphinx/etc/sphinx.conf

P.s/usr/local/sphinx/bin/indexer-c/usr/local/sphinx/etc/sphinx.conf--all--rotate

This eliminates the need to stop searchd, and no need to restart searchd after indexing.

If you want to test whether the incremental index is successful, insert data into the database table, find out if it can be retrieved, this time the retrieval should be empty, and then rebuild the Delta Index separately
/usr/local/sphinx/bin/indexer-c/usr/lcoal/sphinx/etc/sphinx.conf Delta
Check to see if the new records are indexed. If it succeeds, it is then retrieved with the/usr/local/sphing/bin/search tool to see that the result retrieved in the main index is 0, and the result is retrieved in the delta. The precondition, of course, is that the retrieved word exists only in the data that was later inserted.

The next question is how to merge the incremental index with the primary index

4. Index merging
Merging two existing indexes is sometimes more effective than re-indexing all data, although, when the index is merged, the two indexes to be merged are read into memory once, and the merged content is written to disk once, that is, merging the two of 100GB and 1GB so that it will cause 202GB IO operations
Command prototype: indexer--merge dstindex Srcindex [--rotate] merges srcindex into Dstindex, so only dstindex will change, if two indexes are serving, then--Rotate parameter The number is necessary. For example, merge the delta into main.
Indexer--merge Main Delta

5. Automatic index update
Need to use to script.
Create two scripts: Build_main_index.sh and build_delta_index.sh.

Build_main_index.sh:
#!/bin/sh
# Stop a running SEARCHD
/usr/local/sphinx/bin/searchd-c/usr/local/sphinx/etc/mersphinx.conf--stop >>/usr/local/sphinx/var/log/ Sphinx/searchd.log
#建立主索引
/usr/local/sphinx/bin/indexer-c/usr/local/sphinx/etc/mersphinx.conf main >>/usr/local/sphinx/var/log/ Sphinx/mainindex.log
#启动searchd守护程序
/usr/local/sphinx/bin/searchd >>/usr/local/sphinx/var/log/sphinx/searchd.log

build_delta_index.sh

#!/bin/sh
#停止sphinx服务, redirect output to
/usr/local/sphinx/bin/searchd–stop >> /usr/local/sphinx/ Var/log/sphinx/searchd.log
#重新建立索引delta, output redirection
/usr/local/sphinx/bin/indexer delta–c /usr/local/ Sphinx/etc/sphinx.conf>>/usr/lcoal/sphinx/var/log/sphinx/deltaindex.log
#将delta合并到main中
/usr/local /sphinx/bin/indexer–merge main delta–c/usr/local/sphinx/etc/sphinx.conf >>/usr/lcoal/sphinx/var/log/sphinx/ Deltaindex.log
#启动服务
/usr/local/sphinx/bin/searchd >> /usr/local/sphinx/var/log/sphinx/ Searchd.log

After the


script is written, you need to compile chmod +x filename to run it. That is
chmod +x build_main_index.sh
chmod +x build_delta_index.sh

Finally, we need the script to run automatically to achieve that, the delta Index is re-established every 5 minutes, And the main index are only re-established at 2:30 midnight.

To use the crontab command there are two places for reference  crontab  crontab file
Crontab-e To edit the crontab file, which is an empty file if it was not previously used. Write down the following two statements
*/30 * * * *  /bin/sh/usr/local/sphinx/etc/build_delta_index.sh >/dev/null 2>&1
2 * * */bin/sh/usr/local/sphinx/etc/build_main_index.sh >/dev/null 2>&1

The first one means running every 30 minutes/usr/local/ sphinx/etc/under the build_delta_index.sh script, output redirect.
The second is the build_main_inde.sh script that represents the daily 2:30 run/usr/local/sphinx/etc, and the output redirect. The
setting for the previous 5 values is described in detail in the crontab file above. For an explanation of redirection, see the top crontab notes, as well as Crontab's introduction. After the

is saved: Restart the service
 

[[Email protected] init.d]# service Crond stop
[[Email protected] init.d]# service Crond start
Or
/etc/init.d/crontab start

So far, if the script doesn't have a problem, build_delta_index.sh will run every 30 minutes, and build_main_index.sh will run at 2:30.

To verify, in the script, there is the output redirected to the relevant file, you can see whether the record in the following file is increased, you can also see the/usr/local/sphinx/var/log under the Searchd.log, each rebuild index will have a record.

Summarize
1. Index merging issues, as explained earlier, two indexes are merged, read in, and then write the hard disk, IO operation is very large. In the PHP API call, the query ($query, $index) $index can set multiple index names, such as query ($query, "Main;delta"), there is no need to necessarily merge two indexes, or the number of merges not so much.
2. Another one that has not been tried is to store the incremental index in shared memory (/DEV/SHM) to improve index performance and reduce system load. About the PHP API
How to successfully search through PHP pages.
First, the SEARCHD must be running on the server.
Then, modify the following according to Test.php.
Run, the connection will appear with a big problem errno =13 permission deny. Finally, the search for an English page is due to SELinux, which can be found on the Internet. There is no good solution, only SELinux is set to not use. The following two commands are used: Setenforce under/usr/bin
Setenforce 1 setting SELinux to become enforcing mode
Setenforce 0 Setting SELinux to become permissive mode

Sphinx incremental indexing for near real-time updates

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.