Use xapian to build your own search engine: Search

Last Update:2018-12-05 Source: Internet

Author: User

Tags split words

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Document directory

Accuracy and recall rate
Performance
Boolean search
Probabilistic IR and relevance
Queryparser
Query
Practice

Use xapian to build your own search engine: Search

After the previous introduction, if you refer to Omega again, it is estimated that you can successfully create a database and add a document to the database. With data, the next step is of course how to identify them. In an IR system (not just xapian), the retrieval methods are diversified, while the sorting is diversified, the results are user-friendly, which is the biggest advantage compared with relational databases. Because of the large amount of content, you can separate retrieval, sorting, and retrieval results. This article describes how to search.

The IR system has so many advantages that end users have high expectations for it, and everything in the world will never be perfect. Therefore, the IR system has three evaluation criteria: recall rate, accuracy and query efficiency. The three indicators are in conflict with each other. They only have trade-offs and cannot be reconciled. This is also a game process. Users may naturally adopt different views and practices to care about different indicators. For Web search engines, the query efficiency must be the first, and accuracy and recall rate can be considered. Literally, it seems that there is still a spectrum of accuracy, but how can we explain the recall rate?

Accuracy and recall rate

Sometimes, accuracy is also called accuracy. For example, a database contains 500 documents, 50 of which comply with the definition. The system has retrieved 75 documents, but only 45 of them comply with the definition.

Recall rate R = 45/50 = 90%

Precision P = 45/75 = 60%

In this example, the system search is effective and the recall rate is 90%. However, the results are very noisy, and nearly half of the search results are irrelevant. Generally, it is difficult to get a high recall rate without sacrificing precision. For a retrieval system, the recall rate and accuracy cannot be the same: When the recall rate is high, the accuracy is low, and when the accuracy is high, the recall rate is low. For a search engine system, it can search for more and more results to find more relevant results, thus improving the recall rate ), however, it also leads to more irrelevant results, thus reducing the accuracy (precision) of the search results ). Because no search engine system can collect all web pages, it is difficult to calculate the recall rate. Therefore, in general, the recall rate or accuracy is not used separately, but another value is discussed based on a fixed value. For example, the precision value changes when the recall rate is 60%. Therefore, in terms of recall rate and accuracy, web search engines tend to prefer the latter because end users most want the data they want, rather than a bunch of plausible data.

However, for a traditional library information retrieval system, the situation is very different-books and articles have good keyword indexes, including clearly defined structured data such as the title, author, abstract, body, and recording time. The document set is relatively stable and has a relatively small size, an end user may only know one or two words of a book name. If the recall rate is low, the user may be lost.

Speaking of this, we should know that the IR system has different accuracy and recall rates as the evaluation indicators in different application scenarios, accuracy and recall rates are directly influenced by word segmentation strategies. For the Chinese word segmentation we are most concerned about, there are generally the following types of word segmentation strategies:

L first, the default word segmentation. This word segmentation strategy is the easiest to implement. For example, we have the following sentence: "We are eating ", it is split into [Me], [people], [in], [eat], [meal], and [?] by word. The term obtained by word segmentation is the least, because we use thousands of Chinese characters, but the amount of data indexed increases, the growth ratio of index files is larger than that of the following models. Although the recall rate is very high, the accuracy rate is very low, and the performance is generally the worst.

L second, binary segmentation, that is, every two words in a sentence are used as a word. Take the sentence "we are eating" as an example. We can use the binary segmentation method to get the following words: [we], [We are], [we are eating], [we are eating], and [What about meals]. This splitting method is better than the first one. The accuracy rate is improved, and the recall rate is not much lower (in fact, both of them are not high, which is too moderate ).

L category 3: segmentation by word meaning. This method uses dictionaries. Common examples include forward and reverse largest segmentation. Let's take "we are eating" as an example. The following words may be obtained using the forward segmentation method: [we], [eat], [meal], and [?]. Using the reverse largest segmentation rule, the following words may be obtained: [we], [in], [eat], [in]. As long as you are able to search for words in a large dictionary, the dictionary-based word segmentation results will be quite good.

L type 4: segmentation based on statistical probability. Based on a probability model, this method can draw the probability of the next word from an existing word, and give an inappropriate example in the sentence "we are eating, assuming that the word [we] already exists, the probability of [Eating] is derived based on the probability statistical model. Of course, models in practical application are much more complex, such as the famous hidden Markov model.

In actual Chinese Word Segmentation applications, segmentation by dictionary and based on statistical probability are generally combined to eliminate ambiguity and improve accuracy.

Performance

As mentioned above, the query performance by word splitting may be the worst. At a glance, it seems that this word segmentation method is correct with low precision and high recall rate, but why does it mean its performance is poor. For ease of interpretation, we assume that 20 thousand articles need to be stored and indexed. If all the content in the article is Chinese characters, we usually use 4000 ~ 5000, so the average index for each Chinese character is 4 ~ Five articles, unfortunately, many Chinese characters appear frequently, take the Chinese characters [I], [people], [in], [eat], [meal], and [What about] above for example, the probability of appearance in each article is estimated to be at least 70% ~ 80%.

A common storage method is to store indexes and data (that is, the content of the article) separately, and store indexes using various trees (Red/Black, aVL, or B, in addition to the pointer of the parent and son nodes, each node usually stores the Document ID of its index (docid in xapian ), through this ID, you can quickly find the content of the article. In xapian, docid is represented by a 32-bit unsigned integer, which occupies 4 bytes. If the word "I" appears in 20 thousand articles, the probability is 50%, then the node "I" occupies at least 4*1000 bytes, almost 40 k! If one day our permanent storage is as fast as the memory, this storage method is not a problem, but because we currently use hard disks/tape drives to store permanent data, A commercial hard drive/tape drive uses a head controlled by a robotic arm to read and write disks to access data. To reduce the number of head positions, the hard drive/tape drive is designed to read data by page, each page occupies 2 ~ 2 bytes. Although well designed, the access speed of the hard disk/tape drive is about 5 orders of magnitude slower than that of the primary storage. This is why I/O is the most performance-consuming, it is also the reason why we say "database is the bottleneck" every day.

Obviously, if, based on the above inference, the "I" Node accounts for more than 10 disk pages, this is crazy. If an article is divided into multiple words by Word Segmentation technology, the number of words indexed by each word must be reduced. As mentioned above, the indexes of most IR or database systems are stored in the form of B * trees. The B * tree is a data structure with excellent hard disk I/O performance, generally, the size of each node is the same as the size of each page on the hard disk. Each node can store N keywords, and each node has n + 1 children. That is to say, in a B * tree with a height of 2, a maximum of 2 nodes can be read to reach the target node. That is to say, the mechanical arm that controls the head is moved only twice. At this time, the superiority of a good data structure is shown.

Of course, this is only a purely hard drive/tape drive-centered discussion. In actual application, the architecture will be better. If there are only 20 thousand articles, when our primary memory is large enough, you can even read all the articles in the memory once to avoid hard disk I/O operations, but this also brings embarrassment when writing data very slowly. The data files in the current database or IR system are usually several GB, so how to avoid frequent hard disk I/O reads and writes is the first to improve performance.

However, do not think that the IR system is better than the relational database. One weakness of the IR system is that the insertion, modification, and deletion are relatively slow, because they need to be processed in multiple layers, therefore, the primary task of the IR system is retrieval and storage.

Boolean search

Although the IR system can help us to split words, sometimes we want to "help" the IR system to understand what we are searching. For example, we may enter "we eat" in the search bar of Baidu or Google to search for articles about "we" and "dinner" that we are interested in, instead of simply entering "we eat" to search for articles. The results of these two inputs are completely different, because "we eat" has become one of the terms in Google's IR system.

Input such as "we eat" is actually a Boolean search. In xapian, multiple terms are connected with and, or, or and_not. For example:

T1 indexed documents 1 2 3 5 8

Documents 2 3 6 indexed by T2

So:

T1 and t2 search 2 3

T1 or t2: 1 2 3 5 6 8

T1 and_not t2: 1 5 8

T2 and_not T1 6 retrieved

In many systems, these documents are not sorted based on their correlations. However, in xapian, Boolean queries can retrieve the results of the documents set, then use probabilistic sorting.

Probabilistic IR and relevance

Boolean search is the most commonly used, but in the IR system, it has not been able to bear the banner, because the results obtained using the Boolean search are not made more user-friendly by any mechanism, in this case, users must have a full understanding of the IR system to use it more effectively. Despite this, only the pure Boolean Retrieval IR system remains well-lived.

Relevance is the core concept in the probability model. The set of parameters can be sorted by relevance. In essence, when a document is required by the user, it is related. Otherwise, it is irrelevant. in ideal conditions, all retrieved documents are related, what is not retrieved is unrelated. This is a concept of black and white. However, the search is rarely perfect, so there will be inconsistencies between the two, so we use relevance to represent the percentage of mutual association between the two things, this is a very complicated theory.

The default sorting mode of xapian is bm25weight. This is a Sort weight mode that uses a fixed formula to obtain the frequency of occurrence of elements such as Word Frequency and document. The higher the weight, the higher the relevance, if you do not want to use bm25weight as the sorting mode, you can use boolweight. The weight of each element in boolweight mode is 0. The sorting will be further described in subsequent articles.

Combined search

By default, xapian can use any combination of complex Boolean Query expressions to narrow the search scope, then sort the results in an probabilistic Order (some Boolean systems can only limit the query expression to a certain format ).

There are two combinations of Boolean search and probabilistic search:

Use a Boolean search to obtain a subset of all documents, and then use probabilistic search in this subset.
Perform probabilistic search first, and then use Boolean search to filter query results.

The results of the two methods are slightly different. For example, a database contains two types of documents:Grand"This word exists in both languages (meaning similar), but it is more common in French. However, if you use the first method, use Boolean search to limit the English subset first, this word will get more weights.

The first method is more accurate, but the execution efficiency is not high. xapian specially optimized the second method. Do not think that xapian is actually performing probabilistic search before performing Boolean search, in fact, xapian executes these two operations at the same time. Xapian performs several optimizations internally. For example, if the results can be obtained through probabilistic search, xapian will cancel the ongoing Boolean and operation. These optimization methods can be evaluated to Improve the Performance by several times, and better performance can be achieved when multiple terms queries are executed.

Queryparser

In the IR system, end users input data in a format agreed by the system. These input data are called "queries ". Then, the IR system transfers the input to the queryer. The queryer is also part of the IR system. It can parse the "query", match the statements, sort the result set, and return the result to the end user.

In xapian, the query class plays a "query" role. There are two query generation methods. The first one is generated by queryparser class parsing query strings, the other is to create multiple query classes that represent different descriptive expressions, and then combine these queries as needed.

The following are the syntaxes supported by xapian: queryparser. In fact, these syntaxes are similar to those of other IR systems.

L and

Expression and ExpressionExtract the intersection of the elements matching the two expressions.

L or

Expression or expressionExtract the union of the two statements that match them.

L not

Expression not expression is used to extract only the documents set of the expression on the left.

If the flag_pure_not flag is set, notExpressionThe expression does not extract the documents matching the expression.

L XOR

ExpressionXORExpressionOnly the documents that match the left expression and one of the right expressions are extracted, without the documents that both match.

L combination expression

Parentheses can be used to enclose the boolean operators to control their priority, for example: (one or two) and three.

L + and-

A group of terms labeled with the + or-operator only extracts and matches all + terms, but does not match all-terms. If terms does not mark the + or-operator, it will help sort ents rankings.

L near

One near two near three will extract the documents that match the three keywords at 10. Where will the word distance come from? In the article "using xapian to build your own search engines: Document, term, and value", we have introduced how to use the add_posting method of the document class to add terms with word distance.

The default word distance of near is 10, which can be set using near/N, for example, one near/6 two.

L adj

Adj is similar to near, but terms on both sides are compared in order. Therefore, one adj two adj three indicates that the word distance between one and two and three is 10.

L phrase search

A phrase is enclosed in double quotes and can be used in a file name or email address.

L use field names

If the terms in the database has already added a prefix, you can use the add_prefix method of queryparser to set the prefix map. For example, queryparser. add_prefix ("subject", "S") maps subject to S. If the value of a term is "s title", you can use "Subject: the title. In this case, you may remember that Google also supports this syntax. For example, if you enter "site: www.wlstock.com" in the Google search box, you will only retrieve the stock webpage in www.wlstock.com, this function also implements the field function of Lucene.

L range search

In xapian, the range search is supported by the xapian: valuerangeprocessor class and will only appear after xapian 1.0.0. You can see from the xapian: valuerangeprocessor name that it can only search for the value range, but not the terms range.

Xapian: valuerangeprocessor is an abstract base class. Therefore, to use its subclass in practical applications, xapian provides three out-of-the-box xapian: valuerangeprocessor subclasses, stringvaluerangeprocessor, datevaluerangeprocessor, and numbervaluerangeprocessor can inherit from xapian: valuerangeprocessor to create their own subclasses.

When xapian: valuerangeprocessor is used, the start range and end range should be passed to it. If the subclass of xapian: valuerangeprocessor cannot understand the passed range, it will return xapian :: bad_valueno.

The following uses stringvaluerangeprocessor as an example. When the database stores the user name in the value of Number 4 (the value is identified by a number. For details, see "build your own search engine using xapian: document, term, and value), you can organize the query expression as follows: Mars Asimov .. bradbury, but this is not enough. You also need to create a stringvaluerangeprocessor.

Xapian: queryparser QP;

Xapian: stringvaluerangeprocessor author_proc (4 );

QP. add_valuerangeprocessor (& author_proc );

The op_value_range flag is used when queryparser parses the query expression. Therefore, the query generated by queryparser returns the following description:

Xapian: Query (Mars :( Pos = 1) filter (value_range 4 Asimov Bradbury)

(Value_range 4 Asimov Bradbury) this subexpression uses the value that matches only the value of Number 4 to be> = Asimov and <= Bradbury (use string comparison ).

Value range search is not complex. For more information, see http://www.xapian.org/docs/valueranges.html.

L alias

Queryparser also supports alias search using the following syntax:~term。How to add an alias is described later.

L wildcard

Queryparser supports wildcards ending with "*". Therefore, "wildc *" can match "wildcard", "wildcarded", "wildcards", "WILDCAT", and "Wildcats ". However, this function is disabled by default. You can set xapian: queryparser: flag_wildcard

To enable xapian: queryparser: parse_query (QUERY_STRING, flags) as a flag, follow these steps.

Query

If you do not want to use a string query expression, you can use the following operators to combine multiple queries:

Op_and	Equivalent to and supported by queryparser
Op_or	Equivalent to or supported by queryparser
Op_and_not	Equivalent to the and_not supported by queryparser
Op_xor	Equivalent to the XOR supported by queryparser
Op_and_maybe	Only the documents matching the child expression on the left are returned. However, all the documents matching the expressions on both sides are weighted.
Op_filter	The function is similar to and, but only the documents matching the expression on the left are added to weight calculation.
Op_near	Equivalent to the near supported by queryparser
Op_phrase	Equivalent to
Op_value_range	It is equivalent to the range search supported by queryparser.
Op_scale_weight	Specify the weight for the subexpression. If the weight is 0, this expression is a pure Boolean query.
Op_elite_set	The role is similar to op_or, but sometimes the performance is better than op_or. Here is a detailed explanation: http://trac.xapian.org/wiki/FAQ/EliteSet
Op_value_ge	Returns the value greater than or equal to the given document value.
Op_value_le	Returns the value less than or equal to the specified document value.

L how to create a query that only contains one term

You can use the default constructor: xapian: Query query (TERM );

You can also use the multi-parameter constructor:

Xapian: Query (const string & tname _,

Xapian: termcount wqf _ = 1,

Xapian: termpos term_pos _ = 0) where the full name of wqf isWIthinQUeryFRequency, you can specify the weight of this term in the query. If the entire query contains only one term, this parameter is of little use. However, when combined queries are made, the power is shown, because the result set that can be obtained is more relevant to this term.

Term_pos refers to the position of a term in the query. If the entire query contains only one term, it is of little use. Therefore, it is generally used in phrase search.

L combine multiple queries for query

The query operator mentioned above is used to combine the query. In this case, another constructor of xapian: query is used:

Xapian: Query (xapian: Query: op _,

Const xapian: Query & left,

Const xapian: Query & right)

L probabilistic Query

A general probabilistic query actually connects terms with xapian: Query: op_or. For example:

Xapian: Query query ("Regulation "));

Query = xapian: Query (xapian: Query: op_or, query, xapian: Query ("import "));

Query = xapian: Query (xapian: Query: op_or, query, xapian: Query ("Export "));

Query = xapian: Query (xapian: Query: op_or, query, xapian: Query ("canned "));

Query = xapian: Query (xapian: Query: op_or, query, xapian: Query ("fish "));

However, this style is too bloated. You can use the following refreshing style:

Vector <string> terms;

Terms. push_back ("Regulation ");

Terms. push_back ("import ");

Terms. push_back ("Export ");

Terms. push_back ("canned ");

Terms. push_back ("fish ");

    Xapian::Query query(Xapian::Query::OP_OR, terms.begin(), terms.end());

L Boolean Query

Suppose there is such a Boolean query expression:

('EC'-'France ') and ('20170301' or '20160301' or '20160301') and 'gateate law'

This coshould be built up as bquery like this, then use query to represent the following

Xapian: Query bquery1 (xapian: Query: op_and_not, "EEC", "France ");

Xapian: Query bquery2 ("1989 ");

Bquery2 = xapian: Query (xapian: Query: op_or, bquery2, "1991 ");

Bquery2 = xapian: Query (xapian: Query: op_or, bquery2, "1992 ");

Xapian: Query bquery3 ("effecate Law ");

     Xapian::Query bquery(Xapian::Query::OP_AND, bquery1, Xapian::Query(Xapian::Query::OP_AND(bquery2, bquery3)));

You can also append the bquery object created above to another probabilistic query as a Boolean filter to filter the result set:

Query = xapian: Query (xapian: Query: op_filter, query, bquery );

L + and-Operators

For example, the query expression regulation Import Export + canned + fish-Japan

The conversion to query is as follows:

Vector <string> plus_terms;

Vector <string> minus_terms;

Vector <string> normal_terms;

Plus_terms.push_back ("canned ");

Plus_terms.push_back ("fish ");

Minus_terms.push_back ("Japan ");

Normal_terms.push_back ("Regulation ");

Normal_terms.push_back ("import ");

Normal_terms.push_back ("Export ");

Xapian: Query query (xapian: Query: op_and_maybe,

Xapian: Query (xapian: Query: op_and, plus_terms.begin (), plus_terms.end ());

Xapian: Query (xapian: Query: op_or, normal_terms.begin (), normal_terms.end ()));

Query = xapian: Query (xapian: Query: op_and_not,

Query,

Xapian: Query (xapian: Query: op_or, minus_terms.begin (), minus_terms.end ()));

Practice

After a query object is created using the queryparser class or query class, you only need to instantiate a queryparser class to use these query objects. Example:

Xapian: Database dB ("Index ");

Enquire enquire (db );

Enquire. set_query (query );

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More