Originality: The most comprehensive and profound interpretation of the BM25 model in history and an in-depth explanation of lucene sequencing (Shankiang)

Last Update:2017-02-21 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The optimization of vertical search results includes the control of search results and the optimization of sorting, among which the ranking is the most serious. In this paper, we will thoroughly explore the evolutionary process of the vertical search ranking model, and finally deduce the ordering of the BM25 model. Then we'll show you how to modify Lucene's sort source code, and the next one will delve into the current hot machine learning sort in vertical search. The structure of the article is as follows:

First, the VSM model simple introduction;

Second, lucene default scoring formula is introduced;

Iii. Two-yuan independent model in probabilistic language model BIM introduction;

Iv. introduction of BM25;

Five, Lucene in the Edismax parser introduction and scoring formula source code introduction;

Six, modify the sorting source code;

Vii. Machine learning sequencing: Why ① need machine learning sequencing ② machine learning sorting algorithm Introduction ③ The interpretation of the original English academic thesis about listnet algorithm ④ the implementation of machine learning sequencing

The first part: VSM

The VSM is referred to as vector space model, which is mainly used to calculate the similarity of documents. When calculating document similarity, important features need to be extracted. Feature extraction generally uses the most general general method: TF-IDF algorithm. This method is very simple but very practical. Give you an article, with the Chinese word breaker tool (currently the best is the OPENNLP community in the open source source package HANLP) to the document segmentation, processing into a word vector (after removing the results of the stop word), and then calculate the tf-IDF, in descending order, ranked in the first few are important features. TF-IDF is not discussed here , because it is too simple. So for a query Q, after word processing form query vector t[t1,t2 ...], give each t a weight value, assuming a total of query to n documents, each document processed into vector (by T processing), calculate each T in the respective document TF-IDF. The cosine similarity of the t vectors is then computed separately, and the scores are sorted in descending order.

The essence of VSM is to calculate the similarity between the query and the document content. The correlation was not taken into account. Because the user is entering a query, the most wanted is a document with a large correlation, not just a query word in this document. Because the query word is not necessarily related to a document, it is necessary to introduce a probabilistic model. The next BIM and BM25 essence is: Calculating the similarity between the query and the user's requirements. So BM25 will have a good performance. The Lucen bottom default rating extends the VSM. Below goes the second part, Lucent's default scoring formula:

Second, lucene default scoring formula introduction

The Lucene scoring system/mechanism (Lucene scoring) is a core part of Lucene's reputation. It hides a lot of complicated details for the user, which makes it easy for users to use Lucene. But personally think: if you want to adjust the score (or structure sort) according to your own application, it is very important to have a thorough understanding of lucene scoring mechanism.

The Lucene scoring combination uses the vector space model and the Boolean model of information retrieval.

First look at Lucene's scoring formula (description in the similarity class)

Score (Q,D) = Coord (q,d) · Querynorm (q) ·	∑	(TF (T in D) • IDF (t) 2 · T.getboost () · Norm (T,d))
	T in Q

which

TF (T in D) is associated to the item frequency, and the item frequency is the number of times that item T appears in document D frequency. The default implementation is:

TF (T in D) = Frequency?

IDF (t) is associated to the reverse document frequency, and the document frequency refers to the number of documents that appear in item T docfreq. The less docfreq the more IDF the higher (the matter is more expensive), but under the same query the values are the same. Default implementation:

IDF (t) =

1 + log (

Numdocs

–––––––––

Docfreq+1

)

about IDF (T) should be aware of this: a word appears in the document collection n times, and the total number of document collections is N. IDF (t) originates from information theory. So the probability of this word appearing in each document is: n/n, so the amount of information that appears in this document is:-log (n/n). This is similar to the information entropy (-P (x) Logp (x)), in the data mining filtering method for feature selection, we need to use mutual information, in fact, the calculation of information gain, as well as decision trees. Transform-log (n/n), log (n/n), in order to avoid the appearance of 0, smoothing, is the above formula (just like naive Bayes need Laplace smoothing).
The coord (q,d) score factor is based on the number of query items that appear in the document. The more query items in a document, the higher the number of matching programs that describe the documents. The default is the percentage of query items that appear.

querynorm (q)Query the standard query so that different queries can be compared. This factor does not affect the sorting of documents, because this factor is used by all documents. Default value:

Querynorm (q) = Querynorm (sumofsquaredweights) =

––––––––––––––

Sumofsquaredweights?

The split Felix (Sumofsquaredweights) for each query item weight is done by the Weight class. For example, booleanquery calculation:

Sumofsquaredweights = Q.getboost () 2 ·	∑	(IDF (t) · T.getboost ()) 2
	T in Q

t.getboost () the item T-weighted (for example: java^1.2) of the query period, or Setboost () by the program.
Norm (t,d)Compress the weighting and length factors during several indexes:
All of these factors multiply to derive the norm value, and if the document has the same field, their weighting is multiplied:

Norm (t,d) = Doc.getboost () · Lengthnorm (field) · ∏ F.getboost ()

Field F in D named as T

Index, the norm value is compressed (encode) into a byte stored in the index. When searching, the norm value in the index is decompressed (decode) into a float value, which is encode/decode provided by similarity. "The process is not reversible due to accuracy problems, such as: Decode (Encode (0.89)) = 0.75," the official said.

Overall, the scoring formula is still calculated based on the similarity between the query and the document content. Also, lengthnorm (field) = 1/sqrt (numterms), which is the longer the index column of the document, the lower the score. This is obviously unreasonable and needs to be improved. Moreover, this scoring formula only takes into account the TF in the document vector, and does not consider the TF in the T (query vector), and if a document is longer, its TF will generally be higher and will be positively correlated. This is not fair for the short file to calculate the TF. When scoring with this formula, it is necessary to normalization the document vector, and the lengthnorm how to deal with it is a problem. For example, when playing badminton with a ball, the racket will have an area of the best batting and ball back, and is becoming a "sweet zone". When dealing with the length of a document vector, we can also specify a "sweet zone", such as Min/max, which exceeds this range, and Lengthnorm is set to 1. Based on the above shortcomings, we need to improve the ranking model, so that the query and user needs more relevant, so the probability model, the following into the third part:

Iii. Two-yuan independent model in probabilistic language model BIM Introduction

The probabilistic retrieval model is derived from the principle of probability sequencing, so understanding this principle is very important for understanding the probabilistic retrieval model. The idea of a probabilistic sequencing model is that given a query, the returned document can be sorted by the relevance score of the query and the user's needs. This is a method of modeling the relevance of user requirements. Think about it as follows: First, we can categorize the documents we get after the query: Related documents and non-related documents. This can be considered based on the naïve Bayesian generation learning model. If the probability of this document being a dependency is greater than that of non-correlation, then it is a dependency document, which is a non-related document. So, the probability model is introduced: P (r| D) is the probability of a document correlation, P (nr| D) is the probability of a document being non-correlated. If P (r| D) > P (nr| D), which indicates that it is related to the query and is intended by the user. According to this idea, how can we calculate this probability? If you're familiar with naive Bayes, it's easy. P (r| D) = P (d| R) p (r)/p (D), p (nr| D) = P (d| NR) p (NR)/P (D). The purpose of calculating correlations with probabilistic models is to determine whether a document is P (r| D) > P (nr| D), i.e. P (d| R) p (r)/p (D) > P (d| NR) p (NR)/P (D) <=> p (d| R) p (r) > P (d| NR) p (NR) <=> p (d| R)/P (d| NR) > P (NR)/P (R). For search, it is not necessary to really classify, just calculate P (d| R)/P (d| NR) is then sorted in descending order. So the introduction of the two-yuan independent model (binary independent models) assumes that the =

① two Yuan hypothesis: when data is modeled on a document vector, it is assumed that the value of the feature belongs to the Bernoulli distribution and its value is 0 or 1 (Naive Bayes applies to the Bernoulli distribution of both the special and the categorical values, and loggistic The regression is applicable to the Bernoulli distribution of the classification values. In the field of text processing, this feature appears in the document or does not appear, regardless of the word frequency.

② Lexical Independence hypothesis: it is assumed that the words that make up each feature are independent of each other, and there is no correlation. In the field of machine learning, when joint likelihood estimation or conditional likelihood estimation is performed, it is assumed that the data follow the IID distribution. In fact, the independent hypothesis of vocabulary is very unreasonable. For example, "Jobs" and "ipad" and "Apple" there is a correlation.

With the assumption above, you can calculate the probability. For example, there is a document D, the query vector consists of 5 term, the distribution in D is as follows: [1,0,1,01]. So, P (d| R) = p1* (1-P2) *p3* (1-P4) *p5. PI is the probability that the feature appears in D, and the second and fourth words do not appear, so use (1-P2) and (1-P4). This is the probability that the document is relevant, and the generation model also needs to calculate the probability of non-correlation. Using SI to represent the probability of a feature appearing in a non-correlated document, p (d| NR) =s1* (1-S2) *s3* (1-S4) *s5. =

, the first item in the formula represents the product of each characteristic probability that appears in D, and the second indicates that there is no probability product in D. Further transformations are obtained:

In this formula, the first part is the characteristic probability product appearing in the document, the second one is the probability product of all the features, and is derived from the global calculation. For a particular document, the second item has no effect on the sort, and the result is the same, so it is removed. Thus, the final result is reached:. To calculate the convenience, take the logarithm of this formula:. This formula is further solved:. where n represents the total number of document collections and R represents the total number of related documents, then N-R is the number of non-related documents, and NI represents the number of documents containing the feature Di, in which the number of related documents is RI. So. When a query q is present and the document is returned, only the probability product of the occurrence of the feature is computed, and the predict principle of naive Bayes is the same. This formula can be converted to the IDF model under certain circumstances. The above formula is the basis of the BM25 model. Let's talk about part four below.

Four, BM25 model

The BIM model is derived from the two-yuan independent hypothesis and considers only whether the feature is present, regardless of the TF factor. Then, if you consider the TF factor on this basis, it will be more perfect, so someone put forward the BM25 model. In addition, the weights in the query vector and the weights in the document are added to a series of empirical factors. The formula is as follows:

The first is the formula derived from the BIM model, because we don't know which ones are related and which are not, so we set RI and R to 0, so the first one degenerates into

It's idf!. Very magical! , FI is the TF weight in document D, Qfi is the tf weight of the feature in the query vector. In general, k1=1.2,b=0.75,k2=200. When the query vector is shorter, the qfi usually takes a value of 1. In analysis, when k1=0, the second item does not work, that is, the TF weight is not considered in the document, and when K2=0, the third item also fails. It can be seen that the K1 and K2 values are the penalty factors for the TF weights in the document or query vectors of the feature. In a comprehensive view, BM25 considered 4 factors: IDF factor, document length factor, document frequency factor and query frequency factor. With the above 4 parts, I believe most people will have a very deep understanding of Lucene scoring formula, the following into the source code interpretation and modification phase, mainly in order to be able to meet the time of the business scene custom-defined sorting. Enter the fifth part:

Five, Edismax parser introduction:

This query parser is introduced because of the need for a particular business scenario. Lucene's source package, two core packages, Org.apache.lucene.index and Org.apache.lucene.search. The first package calls the store, util, and document child packages, and the second interacts with the Queryparser and analysis, message child packages. In the query, the most important thing is queryparser. When the user input query string, call Lucene query service, to call the Queryparser class, the first step is to call Analyzer (Word breaker) to form a query vector t[t1,t2......tn], this step is lexical analysis, followed by syntactic analysis, the formation of query syntax, That is, the formation of querynode--->querytree. T1 and T2 is a logical relationship with a Boolean query. This allows Lucene to understand the query syntax. To deepen your understanding, look at the code first:

Package com.txq.lucene.queryParser;

Import java.io.IOException;
Import Java.io.StringReader;
Import java.util.ArrayList;
Import Org.apache.lucene.analysis.Analyzer;
Import Org.apache.lucene.analysis.TokenStream;
Import Org.apache.lucene.analysis.tokenattributes.CharTermAttribute;
Import Org.apache.lucene.analysis.tokenattributes.OffsetAttribute;
Import Org.apache.lucene.analysis.tokenattributes.TypeAttribute;
Import org.apache.lucene.queryparser.classic.ParseException;
Import Org.apache.lucene.queryparser.classic.QueryParser;
Import org.apache.lucene.util.Version;
Import Org.apache.lucene.search.BooleanQuery;
Import Org.apache.lucene.search.PhraseQuery;
Import Org.apache.lucene.search.Query;
Import Org.apache.lucene.search.TermQuery;
Import Org.apache.lucene.search.BooleanClause.Occur;
Import Org.apache.lucene.index.Term;
/**
* Customize a query parser, booleanquery
* @author Xueqiang Tong
*
*/
public class Blankandqueryparser extends Queryparser {
Analyzer = new Ikanalyzer (false);
Public Blankandqueryparser (Version matchversion, String Field, Analyzer Analyzer) {
Super (Matchversion, field, analyzer);
}

Protected Query getfieldquery (String field,string querytext,int slop) throws parseexception{
try {
Tokenstream ts = This.getanalyzer (). tokenstream (field, new StringReader (QueryText));
Offsetattribute offset = (offsetattribute) ts.addattribute (Offsetattribute.class);
Chartermattribute term = (chartermattribute) ts.addattribute (Chartermattribute.class);
Typeattribute type = (Typeattribute) ts.addattribute (Typeattribute.class);
Ts.reset ();
Arraylist<chartermattribute> v = new arraylist<chartermattribute> ();
while (Ts.incrementtoken ()) {
System.out.println (Offset.startoffset () + "-"
+ offset.endoffset () + ":" + term.tostring () + "|"
+ Type.type ());
if (term.tostring () = = null) {
Break
}
V.add (term);
}
Ts.end ();
Ts.close ();
if (v.size () = = 0) {
return null;
} else if (v.size () = = 1) {
return new Termquery (New term (field,v.get (0). toString ()));
} else {
Phrasequery q = new Phrasequery ();
Booleanquery B = new Booleanquery ();
Q.setboost (2048.0f);
B.setboost (0.001f);
for (int i = 0;i < V.size (); i++) {
Chartermattribute t = v.get (i);
Q.add (New Term (field,t.tostring ()));
Termquery tmp = new Termquery (New term (field,t.tostring ()));
Tmp.setboost (0.01f);
B.add (TMP, occur.must);
}
return b;
}

} catch (IOException e) {
E.printstacktrace ();
}
return null;
}

Protected Query getfieldquery (String field,string querytext) throws parseexception{
Return Getfieldquery (field,querytext,0);
}
}
The above code, which shows the basic steps of queryparser, requires a word breaker, and I used the word breaker I wrote last year based on the inverse maximum matching algorithm, which was transformed by IK participle. From above you can basically understand how booleanquery works. The algorithm for intersection of two arrays written last year is an abstraction of the logic problem of Boolean query for and. Phrase query is the most accurate, storing the location information of the lexical elements in inverted index entries, which is supported by the phrase query function. Now return to the fifth part of the content, there is now a sort of business scenario: there is an e-commerce platform, the volume is very fire, the click volume is large, need to customize a more reasonable in line with their own company's sorting needs. First of all, ask the following requirements: According to the time of check-in merchant, whether the merchant is a VIP and the Product Ctr (point) Synthetically consider the three factors to arrive at the final score. In order to complete such a sort of task, first comb the following ideas: This ordering requirements, according to Lucene existing score formula is certainly not satisfied, this is the user's external custom collation, and the underlying collation is irrelevant. Only SOLR's Edismax parser can satisfy this requirement. So according to the following ideas: first understand how edismax use (such as can be defined outside the linear function implementation rules), and then also need the internal principle of the parser (see the source code), to see how it relates to the underlying score (it is impossible to have no relationship, so need to delve into the source code, See if there is any need to modify the score source code on the Lucene bottom). According to the above ideas to work, look at the source code found that the best score is the external transfer of the scoring function and the underlying score product, Dismax parser is added. If the Dismax parser is used, the addition does not highlight the function of the above rules, so it is best to use the Edismax parser. Theoretically, if the underlying use of the VSM model or BM25 model, score scoring will have an impact on the business collation, such as some merchants are VIP, click-through rate is high, but the underlying score may be very low, such a multiplication, the final score is not accurate, So you need to write the bottom score to 1, to eliminate the impact. So, according to this rule, the scoring function, the high click-through rate in front (set a higher weight before point) is more reasonable. In order to verify the correctness of the above ideas, we can define the scoring function, do not modify the underlying score, look at the sorting effect, sorting is chaotic. So, based on the above analysis, we need to start with the basic Lucene score scoring source code, and then edismax the source code. In the modification of Lucene score source code, it is best not to use the Jd-gui Anti-compilation tool, the first time, the resulting code is only part of the correct. Build a project with Maven, the package has been downloaded directly, including Lucene-core-4.9.0-sources.jar, modify the source package, and then recompile the package, replacing the original packages. This is a tedious project, including the machine learning sequencing described in the following blog post, and the need to obtain BM25 information when building document data features, as well as Lucene source code to build training systems and predictive systems. Because of the time relationship, continue to write tomorrow, it is too late tonight ...

Originality: The most comprehensive and profound interpretation of the BM25 model in history and an in-depth explanation of lucene sequencing (Shankiang)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More