[Introduction and principle exploration of search engine]sphinx

Last Update:2016-04-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

What is the definition of What/sphinx

Sphinx is a full-text search engine.

Characteristics

Excellent indexing and performance
Easy integration of SQL and XML data sources with Sphinxapi, SPHINXQL, or sphinxse search interfaces
Easy to scale with distributed search
High-speed indexing (peak performance of up to 15mb/seconds on modern CPUs)
High-Performance search (search on 1.2G text, 1 million documents, support up to 150~250 queries per second)

why/Why to use Sphinx encountered usage scenarios

Encounter a similar demand: Users can search through the article title and article content of a piece of article, and the title of the article and the content of the article are stored in different libraries, but also across the computer room.

Optional options

A, directly in the database to implement a cross-library like query

Advantages: Simple operation

Cons: Low efficiency, resulting in greater network overhead

B, combined with Sphinx Chinese word search engine

Advantages: High efficiency, high scalability

Cons: not responsible for data storage

Using the Sphinx search engine to index data, the data is loaded in one time, and then done so after saving in memory. This allows the user to search only when the data is retrieved on the Sphinx server. Moreover, the Sphinx does not have the MySQL companion disk I/O flaw, the performance is better.

Other typical usage scenarios

1. Fast, efficient, scalable and core full-text search

When the volume of data is large, it is faster than MyISAM and InnoDB.
The ability to create indexes on mixed data from multiple source tables, not limited to fields on a single table.
The ability to integrate search results from multiple indexes.
Full-text search can be optimized based on additional criteria on attributes.

2. Efficient use of WHERE clauses and limit words

When a select query is made in multiple where conditions, the index is poorly selective or does not have an index-supported field at all. Sphinx can index keywords. The difference is that in MySQL, the internal engine decides whether to use an index or a full scan, and Sphinx is the one that lets you choose which access method to use. Because Sphinx is storing data in RAM, Sphinx does not do much I/O operations. MySQL has a kind of called quasi-random I/O disk read, a row of records read into the sort buffer, and then sorted, and finally discard most of the rows. So Sphinx uses less memory and disk I/O.

3. Optimize GROUP BY query

Sorting and grouping in Sphinx is done with fixed memory, which is a little more efficient than MySQL queries that can be put in RAM in a similar dataset.

4. Produce result sets in parallel

Sphinx allows you to produce several results simultaneously from the same data, using a fixed amount of memory. In contrast, the traditional SQL method either runs two queries or creates a temporary table for each search result set. Instead, Sphinx uses a multi-query mechanism to accomplish this task. Instead of initiating a query one after the other, a few queries are made into a batch and then submitted in a single request.

5. Scale up and scale out

Scale up: Increase cpu/kernel, extended disk I/O
Scale out: multiple machines, i.e. distributed Sphinx

6. Aggregated Shard Data

Suitable for distributing data between different physical MySQL servers.
Example: There is a 1TB-size table with 1 billion articles, which are fragmented into 10 MySQL servers by user ID, and, under a single user's query, of course quickly, if you need to implement an archive paging feature, show the articles published by all Friends of a user. Then you have to visit more than one MySQL server. This is going to be slow. and Sphinx only need to create a few instances, in each table to map out the frequently accessed article properties, and then can be paged query, a total of three lines of code configuration.

Introduced the work principle of sphinx, about how to install the article in the online there are many, the author will not repeat, and now continue to explain Sphinx configuration files, let Sphinx work up.

how/How to use the Sphinxsphinx workflow flowchart

Flowchart explanation

Database: Data source, which is the source of data for Sphinx indexing. Because Sphinx is irrelevant to the storage engine and database, the data source can be MySQL, PostgreSQL, XML, and so on.

Indexer: An indexer that obtains data from a data source and generates a full-text index of the data. You can periodically run indexer to meet the need to update indexes regularly, as required.

Searchd:searchd directly with the client program and uses the indexer program to build a good index to quickly process search queries.

APP: Client program. Receives a search string from the user input, sends a query to the SEARCHD program, and displays the returned results.

How the Sphinx Works

The entire workflow of Sphinx is to extract data from the database, indexer the data, and then generate single or multiple indexes based on the resulting participle and pass them to the SEARCHD program. The client can then search through the API call.

After introducing the working principle of Sphinx and the configuration of Sphinx, we continue to introduce how the program indexer to do index is indexed in Sphinx.

Sphinx use a configuration file to read data from the database, the data is passed to the indexer program, and then indexer will read the records, according to the word segmentation algorithm for each record index, Word segmentation algorithm can be a meta-word/mmseg participle. The following first describes the data structures and algorithms used by indexer when indexing.

Data Source Configuration

Let's take a look at a sample configuration file for a data source:

{3      type                    = mysql 4   5      sql_host                = 127.0.0.1 6      sql_user                = Root 7      sql_pass                = Root 8      sql_db                  = Test 9 sql_port = 3306 # optional, default is 330610 one sql_query_pre =na Me, add_time from tbl_test13 sql_attr_timestamp = add_time15 sql_query_info_pre = SET NAMES utf817 sql_ Query_info = SELECT * from tbl_test WHERE id= $id

which

Source is followed by the name of the data source, the subsequent indexing will be used;

Type: Data source types, can be mysql,postresql,oracle and so on;

Sql_host, Sql_user, Sql_pass, sql_db, Sql_port are the authentication information of the connection database;

Sql_query_pre: Encoding when defining a query

Sql_query: Data Source Configuration Core statement, Sphinx Use this statement to pull data from the database;

Sql_attr_*: Indexed Properties, additional information (values) attached to each document, can be used for filtering and sorting when searching. Once the property is set, when the Sphinx Search API is called, Sphinx returns the properties that have been set;

Sql_query_info_pre: Set the query encoding, if the question mark in the command line debugging appears garbled, you can set this;

Sql_query_info: Sets the information returned under the command line.

Index configuration

{3     source                    = Test 4     Path                      =/usr/local/coreseek/var/data/test
5     docinfo                   = extern 6     charset_dictpath          =/usr/local/mmseg3/etc/7     charset_type              = Zh_ Cn.utf-8 8     Ngram_len                 = 1 9     ngram_chars               = u+3000..u+2fa1f

which

The Test_index followed by index is the indexed name

Source: Data source name;

Path: Index file base name, the indexer program generates the index file name as a prefix. For example, an attribute assembly exists in/usr/local/sphinx/data/test1.spa, and so on.

DocInfo: Indexed document property value storage mode;

Charset_dictpath: Chinese word segmentation when the dictionary file is enabled directory, the directory must have uni.lib dictionary file exists;

Charset_type: Data encoding type;

Ngram_len: participle length;

Ngram_chars: A valid character set to be recognized by the unary-segmentation sub-mode.

Chinese Word segmentation core configuration unary participle

1 Charset_type = utf82 3 Ngram_len = 5 Ngram_chars = u+3000..u+2fa1f

Mmseg participle

1 Charset_type = utf82 3 Charset_dictpath =/USR/LOCAL/MMSEG3/ETC/4 5 Ngram_len = 0

Run the sample

Database data

Indexing using the indexer program

Inquire

As you can see, the add_time in the configuration file is returned, as shown in 1. The information returned by Sql_query_info is shown in 2.

The configuration of the Sphinx is not very flexible, the configuration of the parts is given here according to the workflow, and more advanced configurations can be consulted when using the document.

Inverted index

An inverted index is a data structure that stores the mapping of a word in a document or a set of documents that is stored under a full-text search. It is the most commonly used data structure in the document retrieval system.

Inverted indexes (inverted index): Inverted indexes are a specific form of storage that implements the word-document matrix, and by inverted index, a list of documents containing the word can be quickly obtained based on the word.

The traditional index is: Index id-> document content, and inverted index is: document content (word breaker), index ID. It can be understood by analogy to the difference between a forward proxy and a reverse proxy. The forward proxy proxies the internal request to the outside, and the reverse proxy proxies the external request to the internal. So it should be understood that the transpose index is more appropriate.

The inverted index consists mainly of two parts: the word dictionary and the inverted file.

A word dictionary is an important part of an inverted index that maintains information about all the words that appear in the document collection, and also records the position of the inverted list in the inverted file for a word. In support of the search, according to the user's query words, go to the Word dictionary query, you can get the corresponding inverted table, and as a basis for subsequent sorting.

For a large collection of documents, may contain hundreds of thousands of or even millions of different words, can quickly locate a word directly affect the response speed of the search, so the need for efficient data structures to build and find word dictionaries, commonly used data structure including loads hash linked list structure and tree-shaped dictionary structure.

Inverted Index Basics

Document: The general search engine is dealing with Internet Web pages, and the concept of the document is more broad, representing the existence of text storage objects, compared to Web pages, covering a variety of forms, such as word,pdf,html,xml and other different formats of files can be called documents. Another such as an email, a text message, a microblog can also be called a document. In the following sections of the book, documents are used to characterize text information in many cases.
Document Collection: A collection of documents is called a collection of documents. For example, a huge amount of internet pages or a large number of e-mails are specific examples of document collections.
Document ID: Within the search engine, each document within the document collection is assigned a unique internal number, which is used as a unique identifier for this document, so that it is easy to process internally, and the internal number of each document is called the "document Number". The following article sometimes uses DocId to easily represent document numbers.
Word ID: Similar to the document number, the search engine internally represents a word with a unique number, and the word number can be used as the unique representation of a word.

Indexer program is based on a well-configured word segmentation algorithm, the acquisition of the records to the word segmentation, and then use inverted index to do data structure preservation.

One-dollar participle of word segmentation algorithm

Core configuration of a unary participle

1 charsey_type = Zh_cn.utf8213 ugram_chars = u+4e00. U+9fbf

Ngram_len is the length of the participle.

Ngram_chars identifies the character set to be split into a unary word segmentation pattern.

Native Sphinx support of the word segmentation algorithm is a unary participle, the word segmentation algorithm is the record of each word after the index, the advantage of this index is high coverage, to ensure that each record can be searched. The disadvantage is that a large index file is generated, which consumes a lot of resources when updating the index. Therefore, if it is not special needs, and the data is not particularly small, it is not recommended to use a unary participle.

On the basis of Sphinx, the Chinese people have developed the Coreseek to support word segmentation. Coreseek and Sphinx The only difference is Coreseek also support mmseg word segmentation algorithm to do Chinese word segmentation.

Mmseg participle

MMSEG segmentation algorithm is based on statistical models, so the rules of the algorithm is also derived from the corpus analysis and mathematical induction, because the Chinese characters do not have a clear demarcation, will lead to a large number of character demarcation ambiguity, but also, the language inside, words and phrases are difficult to define, therefore, the algorithm in addition to do statistics and mathematical induction, Also need to do ambiguous resolution.

In the mmseg participle, there is a concept called chunk.

Chunk, is a word of the word of the way. Includes an array of entries and four rules.

Such as: Research life, there are "research/life" and "graduate/life" two types of participle, this is two chunk.

A chunk has four attributes: length, average length (length/Word count), variance, and single-word degrees of freedom (the sum of the logarithm of the frequency of the words).

After the word segmentation, you will get a variety of word segmentation, it is necessary to use some filtering rules to complete the ambiguity resolution, in order to get the final word segmentation method.

Ambiguity resolution rules:

1. Maximum Matching

A word that matches the maximum length. such as "internationalization", there are "international/", "internationalization" two types of Word, choose the latter.

2. Maximum average word length

Matches the maximum chunk of the average word. such as "Nanjing Yangtze River Bridge", there are "Nanjing/Yangtze River Bridge", "Nanjing/Mayor/Jiang Ohashi" Three kinds of participle, the former average word length is 7/2=3.5, the latter is 7/3=2.3, so choose the former word segmentation.

3. Maximum Variance

The chunk with the greatest variance. such as "Research life science", there are "postgraduate/life/Science", "Research/Life/science" two types of participle, and their word length is the same as 2. So we need to continue filtering, the former variance is 0.82, the latter variance is 0. So choose the first way of participle.

4. Maximum Word freedom

Select the chunk with the highest frequency for a single word. For example, "mainly because", there is "main/IS/because", "master/If/because" the two participle, their word length, variance are the same, and "yes" is a higher frequency, so the choice of the first word segmentation method.

If the above four rules of filtering, the remaining chunk is still greater than one, then the algorithm can do nothing, can only write their own extension to complete.

The last of the last

Of course, some people will say that the database index can also do Sphinx index, but the data structure is not the same, but the biggest difference is that Sphinx is like a single table database without any relational query support. Moreover, indexes are primarily used in the implementation of search functions rather than the primary data sources. Therefore, your database may be in the third paradigm, but the index is completely denormalized and mostly contains data that needs to be searched.
On the other hand, most databases encounter an internal fragmentation problem that requires too many semi-random I/O tasks to be encountered in a large request. That is to say, consider a database in the index, the query points to the index, the index points to the data, if the data because the fragmentation problem is separated on different disks, then this query will take a long time.

Summarize

Through the practice of a project, found that the use of Sphinx mainly in the configuration file, if you know the configuration, then the basic usage is easy to grasp. If you want to dig deeper, like studying how it works, you have to look at more information. Advanced features are not yet available and will be shared later. Finally, if you want to expand Sphinx and customize more powerful features, you can read the source code directly and then write the extension. The use of Sphinx also has drawbacks, if you need to ensure high-quality search, then you should always manually maintain the thesaurus. If you do not keep updating the thesaurus, then you can consider Baidu search and other plug-ins. If you can join machine learning, then it will be better.

Original article, writing Limited, Caishuxueqian, if there is not in the text, million hope to inform.

If this article is helpful to you, please click on the recommendation, write the article is not easy.

[Introduction and principle exploration of search engine]sphinx

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

[Introduction and principle exploration of search engine]sphinx

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

[Introduction and principle exploration of search engine]sphinx

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support