To figure out how the relevance of the document search was calculated behind elastic search, I decided to do my own experiments to explore
This blog is a good one to say.
http://blog.csdn.net/dm_vincent/article/details/42099063
And the blog itself is just translating official documents
Https://www.elastic.co/guide/en/elasticsearch/guide/current/scoring-theory.html
I'm going to test it.
When searching for a document, a combination of the following underlying algorithms is applied. The name sounds bluffing, and it's really bluffing, basically based on some calculations in the theory of statistics.
1. Boolean models (BOOL model)
If you now search for a phrase "Hunter plus Java" (using terms can do) first will apply a bool model, that is, to determine whether the document exists in the existence of one or more of the three term, only the existence of a keyword document can enter the next round of competitive sorting. The BOOL model ensures the real-time and validity of the computation. What the? Why do you want to exclude the keyword first? Even the key words are not, do what!
2. Frequency of entry/inverted document frequency (TF/IDF)
If there are now two documents as follows
{"Name": "Charlie", "description": "Hunterplus Java"}
{"Name": "Charles", "description": "Hunterplus web"}
The following three calculations are examples of these two simple documents
Frequency of entry TF
The frequency of entry, the full term Frequency, referred to as TF, is the BOOL model residue of all the documents to carry out the entry statistics, to obtain a frequency of the square root
This frequency refers to the number of documents in the residual document divided by the total number of remaining documents
* * The formula is TF = sqrt (frequency)
* * Where fre is the frequency at which a word appears in its document
For example Hunterplus This entry in Description This field appears frequency is 1, then
TF=SQRT (1) =1
Reverse Document Frequency IDF
The frequency of inverted documents, full name inverse document Frequency, referred to as IDF, is the calculation of how often an entry occurs in all documents in a database. For example, in an article, and and OR, or in Chinese, the word "" is often present, then if this type of entry as a keyword, its weight should be very low, because too often appear, there is no degree of distinction.
* * The formula is IDF = 1+log (total/(fre+1))
* * Where Total is the number of documents (filtered), fre is the frequency at which a search term appears
Fre is added one for the denominator is not zero, the previous plus one does not affect the function changes
Still take the example above, for hunterplus this term, is
Idf=1+log (2/3) = 0.82
3. Field length Attribution (Field-length Norm)
The field length is reduced to a greater extent for shorter fields, and the longer field weights are relatively low
* * Formula is NORM=1/SQRT (numterms+1)
* * Where numterms is the number of entries in the field that contains the keywords in the document
Still take Hunterplus as an example,
Norm = 1/sqrt (3) = 0.55
Verify
The three factors described by 1.2.3 are calculated as soon as the document is at index and occupy a certain amount of storage space.
We can verify the explain by looking at the specific calculation results.
Curl localhost:9200/candidate/_search?pretty=true&&explain=true-d ' {"query": {"term": {"description": "Java ”}}}’
Unfortunately, the results you see are not consistent with the theoretical values. It's embarrassing. should be ES internal implementation is not theory so simple, there are other aspects of the calculation
However, the basic idea is the same, still is to calculate the statistical value of the entry to analyze the correlation degree.
The above describes the calculation of the relevance of an entry, then, if it is a number of terms, the corresponding correlation of multiple terms constitute a one-dimensional vector, and then calculate the vector distance between different documents, elastic to take the cosine distance
For example, the above example, search two term, one is Hunterplus one is Java, then two document will appear, but the first doc will calculate a relatively large correlation vector, assuming (5,2), another (0,2), then the largest is the most accurate, to find (0,2) and (5,2) the cosine distance, and then sort it.
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Elastic Search Correlation Calculation