[Search engine] Sphinx introduction and Principle Exploration, search engine sphinx
What/definition of Sphinx
Sphinx is a full-text search engine.
Features
- Excellent indexing and Performance
- Easy to integrate SQL and XML data sources, and use the SphinxAPI, SphinxQL, or SphinxSE search interface
- Easy to scale through distributed search
- High-speed index creation (in contemporary CPU, peak performance can reach 10 ~ 15 MB/second)
- High-performance search (search for up to 1 million-150 documents per second ~ 250 queries)
Why/Why use of sphtracing
A requirement similar to this is met: You can search for the content of an article through the article title and article, and the title and content of the article are saved in different libraries, respectively, in addition, it is cross-data center.
Optional
A. perform cross-database LIKE queries directly in the database
Advantage: simple operation
Disadvantage: low efficiency, resulting in large network overhead
B. Combined with the Chinese word segmentation search engine of Sphinx
Advantage: high efficiency and high scalability
Disadvantage: not responsible for data storage
Use the Sphinx search engine to index the data, load the data at one time, and store the data in the memory. In this way, you only need to search data on the Sphinx server. In addition, sphsf-does not have the defect of MySQL companion machine disk I/O, and the performance is better.
Other typical scenarios
1. fast, efficient, scalable, and core full-text retrieval
- When the data volume is large, it is faster than MyISAM and InnoDB.
- You can create an index for the hybrid data of multiple source tables. It is not limited to fields in a single table.
- It can integrate search results from Multiple indexes.
- Full-text search can be optimized based on additional conditions on the attribute.
2. Use the WHERE clause and LIMIT clause efficiently
When multiple WHERE conditions are used for SELECT queries, the indexing selectivity is poor or there are no fields supported by the index at all, resulting in poor performance. Sphenders can index keywords. The difference is that in MySQL, the internal engine determines whether to use an index or full scan, and sphenders allows you to choose which access method to use. Because sphinx saves data to RAM, sphinx does not perform too many I/O operations. Mysql reads a semi-random I/O disk, reads a record row by row into the sorting buffer, sorts the records, and finally discards most of the rows. Therefore, sphinx uses less memory and disk I/O.
3. Optimize group by queries
In sphinx, sorting and grouping use fixed memory, which is slightly more efficient than MySQL queries where similar datasets can all be stored in RAM.
4. Generate result sets in parallel
Sphinx allows you to generate several results from the same data at the same time, using a fixed amount of memory. For comparison, the traditional SQL method either runs two queries or creates a temporary table for each search result set. Sphinx uses a multi-query mechanism to complete this task. Instead of initiating queries one by one, several queries are made into one batch and then submitted in one request.
5. Scale up and scale out
- Upward Scaling: increase CPU/kernel, and expand disk I/O
- Outward Scaling: multiple machines, that is, distributed sphinx
6. Aggregate partition data
It is applicable when data is distributed across different physical MySQL servers.
Example: There is a 1 TB table, with 1 billion articles. The table is partitioned to 10 MySQL servers by user ID. It is certainly very fast for a single user to query, if you need to implement an archive paging function to display the articles published by all friends of a user. Therefore, it is necessary for colleagues to access multiple MySQL servers. This will be slow. Sphinx only needs to create several instances, map frequently accessed Article attributes in each table, and then perform paging query. A total of three lines of code are configured.
This article introduces the working principle of Sphinx. There are many articles on how to install Sphinx on the Internet, so I will not repeat it again. Now I will continue to explain the Sphinx configuration file to help Sphinx work.
How/How to Use SphinxSphinx Workflow
Flowchart
Database: data source, which is the source of data indexed by Sphinx. Because sphenders are independent of storage engines and databases, the data source can be MySQL, PostgreSQL, XML, and other data.
Indexer: The indexing program that obtains data from the data source and generates a full-text index of the data. You can regularly run Indexer to update indexes as needed.
Searchd: Searchd directly communicates with the client program, and uses indexes built by the Indexer program to quickly process search queries.
APP: client program. Receives the search string from the user input, sends the query to the Searchd program, and displays the returned results.
How sphtracing works
The whole workflow of sphenders is that the Indexer program extracts data from the database, performs word segmentation on the data, generates one or more indexes based on the generated word segmentation, and passes them to the searchd program. Then the client can search by calling the API.
After introducing the working principle of Sphinx and the configuration of Sphinx, we will continue to introduce how the Indexer program responsible for indexing is indexed in Sphinx.
After sphinx reads data from the database using the configuration file, it passes the data to the Indexer program. Then, Indexer reads records one by one and creates indexes for each record based on the word segmentation algorithm, the word segmentation algorithm can be mona1 word segmentation or mmseg word segmentation. The following describes the data structures and algorithms used by Indexer for indexing.
Data source configuration
Let's take a look at the sample configuration file of a Data source:
1 source test 2 { 3 type = mysql 4 5 sql_host = 127.0.0.1 6 sql_user = root 7 sql_pass = root 8 sql_db = test 9 sql_port = 3306 # optional, default is 330610 11 sql_query_pre = SET NAMES utf812 sql_query = SELECT id, name, add_time FROM tbl_test13 14 sql_attr_timestamp = add_time15 16 sql_query_info_pre = SET NAMES utf817 sql_query_info = SELECT * FROM tbl_test WHERE id=$id18 }
Where
The source is followed by the name of the data source, which will be used for indexing later;
Type: data source type, which can be MySQL, PostreSQL, Oracle, etc;
SQL _host, SQL _user, SQL _pass, SQL _db, and SQL _port are the authentication information for connecting to the database;
SQL _query_pre: defines the encoding for the query.
SQL _query: the Core Statement for configuring the data source. sphinx uses this statement to pull data from the database;
SQL _attr _ *: index attribute, additional information (values) appended to each document, which can be used for filtering and sorting during search. After attributes are set, when you call the Sphinx API to search for an API, Sphinx returns the configured attributes;
SQL _query_info_pre: sets the query encoding. If the question mark is garbled during Command Line Debugging, you can set this item;
SQL _query_info: Set the information returned under the command line.
Index Configuration
1 index test_index 2 { 3 source = test 4 path = /usr/local/coreseek/var/data/test
5 docinfo = extern 6 charset_dictpath = /usr/local/mmseg3/etc/ 7 charset_type = zh_cn.utf-8 8 ngram_len = 1 9 ngram_chars = U+3000..U+2FA1F 10 }
Where
The test_index followed by the index is the index name.
Source: Data source Name;
Path: Basic name of the index file. The indexer program uses this path as the prefix to generate the index file name. For example, a property Assembly exists in/usr/local/sphinx/data/test1.spa.
Docinfo: storage mode for index document attribute values;
Charset_dictpath: Specifies the directory where dictionary files are enabled for Chinese word segmentation. A uni. lib dictionary file must exist in this directory;
Charset_type: data encoding type;
Ngram_len: Word Segmentation length;
Ngram_chars: a valid character set to be recognized in the unidimensional character splitting mode.
Chinese Word Segmentation core configuration mona1 Word Segmentation
1 charset_type = utf82 3 ngram_len = 14 5 ngram_chars = U+3000..U+2FA1F
Mmseg Word Segmentation
1 charset_type = utf82 3 charset_dictpath = /usr/local/mmseg3/etc/4 5 ngram_len = 0
Running example
Database Data
Use the indexer program for Indexing
Query
As you can see, add_time in the configuration file is returned, as shown in 1. The information returned by SQL _query_info is shown in 2.
The configuration of Sphinx is not flexible. The configuration of each part is provided based on the workflow. For more advanced configurations, refer to the documentation.
Inverted index
Inverted index is a data structure used to store the ing of a word stored in a document or a group of documents in full-text search. It is the most common data structure in the document retrieval system.
Inverted Index: an Inverted Index is a storage method for implementing the word-document matrix. You can use Inverted indexes to quickly obtain a list of documents containing the word.
The traditional index is: Index ID-> document content, while the inverted index is: document content (Word Segmentation)-> index ID. It can be understood by analogy between forward proxy and reverse proxy. The forward proxy proxies internal requests to the external, and the reverse proxy proxies external requests to the internal. Therefore, it should be understood that the transpose index is more appropriate.
Inverted indexes mainly consist of two parts: "Word Dictionary" and "Inverted File ".
Word Dictionary is a very important part of inverted indexes. It is used to maintain information about all words that have occurred in a document set, it is used to record the location information of the inverted list corresponding to a word in the inverted list file. When searching is supported, You can query words in the Word Dictionary Based on the user's query words to obtain the inverted list, which serves as the basis for subsequent sorting.
A large collection of documents may contain hundreds of thousands or even millions of different words. The ability to quickly locate a word directly affects the search response speed, therefore, an efficient data structure is required to construct and search word dictionaries. Common data structures include hash and linked list structures and tree dictionary structures.
Basic inverted index knowledge
- Document, it covers more forms, such as Word, PDF, html, XML, and other files in different formats. For example, an email, a text message, or a microblog can also be called a document. In the subsequent content of this book, documents are often used to characterize text information.
- Document Collection: A Collection composed of several documents is called a Document Collection. For example, a large number of Internet webpages or a large number of emails are examples of a collection of documents.
- Document ID: in the search engine, each Document in the Document set is assigned a unique internal number, which is used as the unique identifier of the Document, in this way, the internal numbers of each document are referred to as "document numbers", which are sometimes represented by docids.
- Word ID: similar to the document number, a search engine uses a unique number to represent a Word. A Word number can be used as a unique identifier of a Word.
The Indexer program uses a well-configured word Splitting Algorithm to split obtained records, and then uses inverted indexes for data structure storage.
One-dimensional word Splitting Algorithm
Core configuration of one-dollar Word Segmentation
1 charsey_type = zh_cn.utf82 ngram_len = 13 ugram_chars = U+4E00..U+9FBF
Ngram_len is the length of word segmentation.
Ngram_chars identifies the character set in the mona1 word segmentation mode.
The native Sphinx supports the mona1 word segmentation algorithm, which performs indexing after each word in the record is cut. The advantage of this index is that it has a high coverage rate, make sure that each record can be searched. The disadvantage is that a large index file is generated, which consumes a lot of resources when updating the index. Therefore, we do not recommend that you use one-dollar word splitting if there are no special requirements and there is not much data.
Coreseek, which supports Chinese word segmentation, is developed based on sph.pdf. The only difference between Coreseek and Sphinx is that Coreseek also supports mmseg Word Segmentation Algorithm for Chinese word segmentation.
Mmseg Word Segmentation
The mmseg word segmentation algorithm is based on the statistical model. Therefore, the algorithm rules also come from the analysis and mathematical induction of the corpus. Because Chinese characters do not have a clear boundary, a large number of character boundary ambiguities may occur, moreover, words and phrases in Chinese are difficult to define. Therefore, algorithms must not only perform statistical and mathematical induction, but also solve ambiguity.
In mmseg word segmentation, there is a concept called chunk.
Chunk is a word segmentation method. It includes an array of entries and four rules.
For example, there are two word segmentation methods: "research/life" and "graduate student/life", which are two chunks.
A chunk has four attributes: length, average length (length/Word Segmentation), variance, and single-word degree of freedom (the sum of the logarithm of each word term frequency ).
After word segmentation is completed, multiple word segmentation methods will be obtained. At this time, some filtering rules should be used to solve the ambiguity and obtain the final word segmentation method.
Ambiguity Resolution rules:
1. Maximum matching
Match the word with the maximum length. For example, if "Internationalization" is used, there are two word segmentation methods: "Internationalization" and "Internationalization". Select the latter.
2. Maximum mean word length
The chunk that matches the largest average word. For example, the Yangtze River Bridge in Nanjing has three Word Segmentation Methods: Nanjing/Yangtze River Bridge and Nanjing/mayor/jiangda bridge. The average word length of the former is 7/2 = 3.5, the latter is 7/3 = 2.3, so select the Word Segmentation Method of the former.
3. maximum variance
The chunk with the largest variance. For example, there are two word segmentation methods: "graduate student/life/science" and "research/life/science", and their word length is the same as 2. Therefore, you need to continue filtering. The variance of the former is 0.82, and the variance of the latter is 0. Therefore, select the first word segmentation method.
4. maximum single-word Degree of Freedom
Select the chunk with the highest frequency for a single word. For example, the main reason is ", there are" Main/Yes/because "," Main/If/because "Two word segmentation methods have the same word length and variance, the Word Frequency of "yes" is high, so select the first word segmentation method.
If the remaining chunks are still greater than one after filtering by the above four rules, this algorithm is powerless and can only be completed by writing extensions by itself.
Last
Of course, some people will say that the database index can also achieve the sphinx index, but the data structure is different, but the biggest difference is that sphinx is like a single table database that does not support any relational queries. In addition, indexes are mainly used to implement the search function rather than the primary data source. Therefore, your database may conform to the third paradigm, but the index will be completely denstandardized and mainly contain the data to be searched.
In addition, most databases suffer from an internal fragmentation problem. They need to encounter too many semi-random I/O tasks in a large request. That is to say, in the database index, the query points to the index and the index points to the data. If the data is separated in different disks due to fragmentation issues, this query takes a long time.
Summary
Through the practice of a project, it is found that the key points of sphinx are mainly in the configuration file. If you know how to configure it, the basic usage is easy to grasp. If you want to study it in depth, for example, how it works, you have to read more information. Advanced features are not used yet and will be used for sharing later. Finally, if you want to expand sphinx and customize more powerful functions, you can directly read the source code and write the extension. Sphtracing also has drawbacks. To ensure high-quality search, you must manually maintain the dictionary. If the dictionary cannot be updated frequently, you can consider plug-ins such as Baidu search. It will be better if you can join machine learning.
The original article is limited in writing, so it is easy to learn. If there is anything wrong with the article, please let us know.
If this article is helpful to you, click here for recommendations. It is not easy to write articles.