Xapian builds its own search engine: Search

Source: Internet
Author: User
Tags split words

After the previous introduction, if you refer to Omega again, it is estimated that you can successfully create a database and add a document to the database. With data, the next step is of course how to identify them. In an IR system (not just xapian), the retrieval methods are diversified, while the sorting is diversified, the results are user-friendly, which is the biggest advantage compared with relational databases. Because of the large amount of content, you can separate retrieval, sorting, and retrieval results. This article describes how to search.

The IR system has so many advantages that end users have high expectations for it, and everything in the world will never be perfect. Therefore, the IR system has three evaluation criteria: recall rate, accuracy and query efficiency. The three indicators are in conflict with each other. They only have trade-offs and cannot be reconciled. This is also a game process. Users may naturally adopt different views and practices to care about different indicators. For Web search engines, the query efficiency must be the first, and accuracy and recall rate can be considered. Literally, it seems that there is still a spectrum of accuracy, but how can we explain the recall rate?
Accuracy and recall rate
Sometimes, accuracy is also called accuracy. For example, a database contains 500 documents, 50 of which comply with the definition. The system has retrieved 75 documents, but only 45 of them comply with the definition.

Recall rate R = 45/50 = 90%

Precision P = 45/75 = 60%

In this example, the system search is effective and the recall rate is 90%. However, the results are very noisy, and nearly half of the search results are irrelevant. Generally, it is difficult to get a high recall rate without sacrificing precision. For a retrieval system, the recall rate and Accuracy
It is impossible to make the best of both worlds: When the recall rate is high, the accuracy is low, and when the precision is high, the recall rate is low. For a search engine system, it can search for more and more results to find more relevant results, thus improving the recall rate ), however, it also leads to more irrelevant results, thus reducing the accuracy (precision) of the search results ). Because no search engine system can collect all web pages, it is difficult to calculate the recall rate. Therefore, in general, the recall rate or accuracy is not used separately, but another value is discussed based on a fixed value. For example, the precision value changes when the recall rate is 60%. Therefore, in terms of recall rate and accuracy, web search engines tend to prefer the latter because end users most want the data they want, rather than a bunch of plausible data.

However, for a traditional library information retrieval system, the situation is very different-books and articles have good keyword indexes, including clearly defined structured data such as the title, author, abstract, body, and recording time. The document set is relatively stable and has a relatively small size, an end user may only know one or two words of a book name. If the recall rate is low, the user may be lost.

Speaking of this, we should know that the IR system has different accuracy and recall rates as the evaluation indicators in different application scenarios, accuracy and recall rates are directly influenced by word segmentation strategies. For the Chinese word segmentation we are most concerned about, there are generally the following types of word segmentation strategies:

L first, the default word segmentation. This word segmentation strategy is the easiest to implement. For example, we have the following sentence: "We are eating ", it is split into [Me], [people], [in], [eat], [meal], and [?] by word. The term obtained by word segmentation is the least, because we use thousands of Chinese characters, but the amount of data indexed increases, the growth ratio of index files is larger than that of the following models. Although the recall rate is very high, the accuracy rate is very low, and the performance is generally the worst.

L second, binary segmentation, that is, every two words in a sentence are used as a word. Take the sentence "we are eating" as an example. We can use the binary segmentation method to get the following words: [we], [We are], [we are eating], [we are eating], and [What about meals]. This splitting method is better than the first one. The accuracy rate is improved, and the recall rate is not much lower (in fact, both of them are not high, which is too moderate ).

L category 3: segmentation by word meaning. This method uses dictionaries. Common examples include forward and reverse largest segmentation. Let's take "we are eating" as an example. The following words may be obtained using the forward segmentation method: [we], [eat], [meal], and [?]. Using the reverse largest segmentation rule, the following words may be obtained: [we], [in], [eat], [in]. As long as you are able to search for words in a large dictionary, the dictionary-based word segmentation results will be quite good.

L type 4: segmentation based on statistical probability. Based on a probability model, this method can derive the probability of the next word from an existing word. It also uses the sentence "we are eating" as an example that may be inappropriate, assuming that the word [we] already exists, the probability of [Eating] is derived based on the probability statistical model. Of course, models in practical application are much more complex, such as the famous hidden Markov model.

In actual Chinese Word Segmentation applications, segmentation by dictionary and based on statistical probability are generally combined to eliminate ambiguity and improve accuracy.
Performance
As mentioned above, the query performance by word splitting may be the worst. At a glance, it seems that this word segmentation method is correct with low precision and high recall rate, but why does it mean its performance is poor. For ease of interpretation, we assume that 20 thousand articles need to be stored and indexed. If all the content in the article is Chinese characters, we usually use 4000 ~ 5000, so the average index for each Chinese character is 4 ~ Five articles, unfortunately, many Chinese characters appear frequently, take the Chinese characters [I], [people], [in], [eat], [meal], and [What about] above for example, the probability of appearance in each article is estimated to be at least 70% ~ 80%.

A common storage method is to store indexes and data (that is, the content of the article) separately, and store indexes using various trees (Red/Black, aVL, or B, in addition to the pointer of the parent and son nodes, each node usually stores the Document ID of its index (docid in xapian ), through this ID, you can quickly find the content of the article. In xapian, docid is represented by a 32-bit unsigned integer, which occupies 4 bytes. If the word "I" appears in 20 thousand articles, the probability is 50%, then the node "I" occupies at least 4*1000 bytes, almost 40 k! If one day our permanent storage is as fast as the memory, this storage method is not a problem, but because we currently use hard disks/tape drives to store permanent data, A commercial hard drive/tape drive uses a head controlled by a robotic arm to read and write disks to access data. To reduce the number of head positions, the hard drive/tape drive is designed to read data by page, 2 for each page
~ 2 bytes, although well designed, but hard disk/tape
The access speed is about 5 orders of magnitude slower than the primary storage. This is why I/O is the most performance-consuming and is also why we say "database is the bottleneck" every day.

Obviously, if, based on the above inference, "I" takes more than 10 disk pages at the node, this is crazy. If an article is divided into multiple words by Word Segmentation technology, the number of words indexed by each word must be reduced. As mentioned above, the indexes of most IR or database systems are stored in the form of B * trees. The B * tree is a data structure with excellent hard disk I/O performance, generally, the size of each node is the same as the size of each page on the hard disk. Each node can store N keywords, and each node has n + 1 children. That is to say, in a B * tree with a height of 2, a maximum of 2 nodes can be read to reach the target node. That is to say, the mechanical arm that controls the head is moved only twice. At this time, the superiority of a good data structure is shown.

Of course, this is only a purely hard drive/tape drive-centered discussion. In actual application, the architecture will be better. If there are only 20 thousand articles, when our primary memory is large enough, you can even read all the articles in the memory once to avoid hard disk I/O operations, but this also brings embarrassment when writing data very slowly. The data files in the current database or IR system are usually several GB, so how to avoid frequent hard disk I/O reads and writes is the first to improve performance.

However, do not think that the IR system is better than the relational database. One weakness of the IR system is that the insertion, modification, and deletion are relatively slow, because they need to be processed in multiple layers, therefore, the primary task of the IR system is retrieval and storage.
Boolean search
Although the IR system can help us to split words, sometimes we want to "help" the IR system to understand what we want to search. For example, we may enter "we eat" in the search bar of Baidu or Google to search for articles about "we" and "dinner" that we are interested in, instead of simply entering "we eat" to search for articles. The results of these two inputs are completely different, because "we eat" has become one of the terms in Google's IR system.

Input like "we eat" is actually a Boolean search. In xapian, multiple terms are connected with and, or, or and_not. For example:

T1 indexed documents 1 2 3 5 8

Documents 2 3 6 indexed by T2

So:

T1 and t2 search 2 3

T1 or t2: 1 2 3 5 6 8

T1 and_not t2: 1 5 8

T2 and_not T1 6 retrieved

In many systems, these documents are not sorted based on their correlations. However, in xapian, Boolean queries can retrieve the results of the documents set, then use probabilistic sorting.
Probabilistic IR and relevance
Boolean search is the most commonly used, but in the IR system, it has not been able to bear the banner, because the results obtained using the Boolean search are not made more user-friendly by any mechanism, in this case, users must have a full understanding of the IR system to use it more effectively. Despite this, only the pure Boolean Retrieval IR system remains well-lived.

Relevance is the core concept in the probability model. The set of parameters can be sorted by relevance. In essence, when a document is required by the user, it is related. Otherwise, it is irrelevant. in ideal conditions, all retrieved documents are related, what is not retrieved is unrelated. This is a concept of black and white. However, the search is rarely perfect, so there will be inconsistencies between the two, so we use relevance to represent the percentage of mutual association between the two things, this is a very complicated theory.

The default sorting mode of xapian is bm25weight. This is a Sort weight mode that uses a fixed formula to obtain the frequency of occurrence of elements such as Word Frequency and document. The higher the weight, the higher the relevance, if you do not want to use bm25weight as the sorting mode, you can use boolweight. The weight of each element in boolweight mode is 0. The sorting will be further described in subsequent articles.
Combined search
By default, xapian can use any combination of complex Boolean Query expressions to narrow the search scope, then sort the results in an probabilistic Order (some Boolean systems can only limit the query expression to a certain format ).

There are two combinations of Boolean search and probabilistic search:

Use a Boolean search to obtain a subset of all documents, and then use probabilistic search in this subset.
Perform probabilistic search first, and then use Boolean search to filter query results.

The results of the two methods are slightly different. For example, a database contains two types of documents, English and French. The word "grand" exists in both languages (meaning similar ), but it is more common in French. However, if you use the first method, use Boolean search to specify the English subset first, and the word will get more weights.

The first method is more accurate, but the execution efficiency is not high. xapian specially optimized the second method. Do not think that xapian is actually performing probabilistic search before performing Boolean search, in fact, xapian executes these two operations at the same time. Xapian performs several optimizations internally. For example, if the results can be obtained through probabilistic search, xapian will cancel the ongoing Boolean and operation. These optimization methods can be evaluated to Improve the Performance by several times, and better performance can be achieved when multiple terms queries are executed.
Queryparser
In the IR system, end users input data in a format agreed by the system. These input data are called "queries ". Then, the IR system transfers the input to the queryer. The queryer is also part of the IR system. It can parse "query", match the statements, sort the result set, and return the result to the end user.

In xapian, the query class plays the role of "query". There are two query generation methods. The first is generated by queryparser class parsing query strings, the other is to create multiple query classes that represent different descriptive expressions, and then combine these queries as needed.

The following are the syntaxes supported by xapian: queryparser. In fact, these syntaxes are similar to those of other IR systems.

L and

Expression and expression extract the intersection of the elements matched by the two expressions.

L or

Expression or expression is used to extract the union of the elements matched by the two expressions.

L not

Expression not expression is used to extract only the documents set of the expression on the left.

If the flag_pure_not flag is set, the not expression does not extract the documents matching the expression.

L XOR

Expression XOR expression only extracts the documents that match one of the left and right expressions, rather than those that both match.

L combination expression

Parentheses can be used to enclose the boolean operators to control their priority, for example: (one or two) and three.

L + and-

A group of terms labeled with the + or-operator only extracts and matches all + terms, but does not match all-terms. If terms does not mark the + or-operator, it will help sort ents rankings.

L near

One near two near three will extract the documents that match the three keywords at 10. Where will the word distance come from? In the article "using xapian to build your own search engines: Document, term, and value", we have introduced how to use the add_posting method of the document class to add terms with word distance.

The default word distance of near is 10, which can be set using near/N, for example, one near/6 two.

L adj

Adj is similar to near, but terms on both sides are compared in order. Therefore, one adj two adj three indicates that the word distance between one and two and three is 10.

L phrase search

A phrase is enclosed in double quotes and can be used in a file name or email address.

L use field names

If the terms in the database has already added a prefix, you can use the add_prefix method of queryparser to set the prefix map. For example, queryparser. add_prefix ("subject", "S") maps subject to S. If the value of a term is "s title", you can use "Subject: the title "is an expression like this to retrieve results. In this case, you may remember that Google also supports this syntax. For example, enter "site: www.wlstock.com" in the Google search bar.
When the stock is created, only the webpage about the stock in www.wlstock.com is retrieved. This function also implements the field function of Lucene.

L range search

In xapian, the range search is supported by the xapian: valuerangeprocessor class and will only appear after xapian 1.0.0. You can see from the xapian: valuerangeprocessor name that it can only search for the value range, but not the terms range.

Xapian: valuerangeprocessor is an abstract base class. Therefore, to use its subclass in practical applications, xapian provides three out-of-the-box xapian: valuerangeprocessor subclasses, stringvaluerangeprocessor, datevaluerangeprocessor, and numbervaluerangeprocessor can inherit from xapian: valuerangeprocessor to create their own subclasses.

When xapian: valuerangeprocessor is used, the start range and end range should be passed to it. If the subclass of xapian: valuerangeprocessor cannot understand the passed range, it will return xapian :: bad_valueno.

The following uses stringvaluerangeprocessor as an example. When the database stores the user name in the value of Number 4 (the value is identified by a number. For details, see "build your own search engine using xapian: document, term, and value), you can organize the query expression as follows: Mars Asimov .. bradbury, but this is not enough. You also need to create a stringvaluerangeprocessor.

Xapian: queryparser QP;

Xapian: stringvaluerangeprocessor author_proc (4 );

QP. add_valuerangeprocessor (& amp; author_proc );

The op_value_range flag is used when queryparser parses the query expression. Therefore, the query generated by queryparser returns the following description:

Xapian: Query (Mars :( Pos = 1) filter (value_range 4 Asimov Bradbury)

(Value_range 4 Asimov Bradbury) this subexpression uses the value that matches only the value of Number 4 as & gt; = Asimov and & lt; = Bradbury (use string comparison ).

Value range search is not complex. For more information, see http://www.xapian.org/docs/valueranges.html.

L alias

Queryparser also supports alias search. Use the following syntax :~ Term. How to add an alias is described later.

L wildcard

Queryparser supports wildcards ending with "*". Therefore, "wildc *" can match "wildcard", "wildcarded", "wildcards", "WILDCAT", and "Wildcats ". However, this function is disabled by default. You can set xapian: queryparser: flag_wildcard

To enable xapian: queryparser: parse_query (QUERY_STRING, flags) as a flag, follow these steps.
Query
If you do not want to use a string query expression, you can use the following operators to combine multiple queries:

Op_and
Equivalent to and supported by queryparser

Op_or
Equivalent to or supported by queryparser

Op_and_not
Equivalent to the and_not supported by queryparser

Op_xor
Equivalent to the XOR supported by queryparser

Op_and_maybe
Only the documents matching the child expression on the left are returned. However, all the documents matching the expressions on both sides are weighted.

Op_filter
The function is similar to and, but only the documents matching the expression on the left are added to weight calculation.

Op_near
Equivalent to the near supported by queryparser

Op_phrase
Equivalent to

Op_value_range
It is equivalent to the range search supported by queryparser.

Op_scale_weight
Specify the weight for the subexpression. If the weight is 0, this expression is a pure Boolean query.

Op_elite_set
The role is similar to op_or, but sometimes the performance is better than op_or. Here is a detailed explanation: http://trac.xapian.org/wiki/FAQ/EliteSet

Op_value_ge
Returns the value greater than or equal to the given document value.

Op_value_le
Returns the value less than or equal to the specified document value.

& Nbsp;

L how to create a query that only contains one term

You can use the default constructor: xapian: Query query (TERM );

You can also use the multi-parameter constructor:

Xapian: Query (const string & amp; tname _,

Xapian: termcount wqf _ = 1,

Xapian: termpos term_pos _ = 0) where the full name of wqf is within query frequency, you can specify the weight of this term in the query. If the entire query contains only one term, this parameter is of little use. However, when combined queries are made, the power is shown, because the result set that can be obtained is more relevant to this term.

Term_pos refers to the position of a term in the query. If the entire query contains only one term, it is of little use. Therefore, it is generally used in phrase search.

L combine multiple queries for query

The query operator mentioned above is used to combine the query. In this case, another constructor of xapian: query is used:

Xapian: Query (xapian: Query: op _,

Const xapian: Query & amp; left,

Const xapian: Query & amp; right)

L probabilistic Query

A general probabilistic query actually connects terms with xapian: Query: op_or. For example:

Xapian: Query query ("Regulation "));

Query = xapian: Query (xapian: Query: op_or, query, xapian: Query ("import "));

Query = xapian: Query (xapian: Query: op_or, query, xapian: Query ("Export "));

Query = xapian: Query (xapian: Query: op_or, query, xapian: Query ("canned "));

Query = xapian: Query (xapian: Query: op_or, query, xapian: Query ("fish "));

However, this style is too bloated. You can use the following refreshing style:

Vector & lt; string & gt; terms;

Terms. push_back ("Regulation ");

Terms. push_back ("import ");

Terms. push_back ("Export ");

Terms. push_back ("canned ");

Terms. push_back ("fish ");
Xapian: Query query (xapian: Query: op_or, terms. Begin (), terms. End ());
L Boolean Query

Suppose there is such a Boolean query expression:

('EC'-'France ') and ('20170301' or '20160301' or '20160301') and 'gateate law'

This coshould be built up as bquery like this, then use query to represent the following

Xapian: Query bquery1 (xapian: Query: op_and_not, "EEC", "France ");

Xapian: Query bquery2 ("1989 ");

Bquery2 = xapian: Query (xapian: Query: op_or, bquery2, "1991 ");

Bquery2 = xapian: Query (xapian: Query: op_or, bquery2, "1992 ");

Xapian: Query bquery3 ("effecate Law ");

& Nbsp;
Xapian: Query bquery (xapian: Query: op_and, bquery1, xapian: Query (xapian: Query: op_and (bquery2, bquery3 )));
You can also append the bquery object created above to another probabilistic query as a Boolean filter to filter the result set:

Query = xapian: Query (xapian: Query: op_filter, query, bquery );

L + and-Operators

For example, the query expression regulation Import Export + canned + fish-Japan

The conversion to query is as follows:

Vector & lt; string & gt; plus_terms;

Vector & lt; string & gt; minus_terms;

Vector & lt; string & gt; normal_terms;

& Nbsp;

Plus_terms.push_back ("canned ");

Plus_terms.push_back ("fish ");

& Nbsp;

Minus_terms.push_back ("Japan ");

& Nbsp;

Normal_terms.push_back ("Regulation ");

Normal_terms.push_back ("import ");

Normal_terms.push_back ("Export ");

& Nbsp;

Xapian: Query query (xapian: Query: op_and_maybe,

Xapian: Query (xapian: Query: op_and, plus_terms.begin (), plus_terms.end ());

Xapian: Query (xapian: Query: op_or, normal_terms.begin (), normal_terms.end ()));

& Nbsp;

Query = xapian: Query (xapian: Query: op_and_not,

Query,

Xapian: Query (xapian: Query: op_or, minus_terms.begin (), minus_terms.end ()));
Practice
After a query object is created using the queryparser class or query class, you only need to instantiate a queryparser class to use these query objects. Example:

Xapian: Database dB ("Index ");

Enquire enquire (db );

Enquire. set_query (query );

Of course, it takes more time to get the result set, sort the result set, or expand the query, and we will continue to talk about it in the next article.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.