Recently want to learn the next Lucene, the previous run of the demo will feel very magical, what principle, especially to find the highest similarity, the best results. Simply jump directly to this question, a lot of data are mentioned in the VSM (vector space model) is a vector spatial models, according to this model can be the search results of the optimization of the screening, and currently do not know how to prove that only by virtue of imagination should be this way.
1. Take a look at TF/IDF
Let's take a look at a concept called TF/IDF, which is typically used as a way to calculate the weight of a search keyword in a document or a whole query phrase. A few days ago saw Wu Teacher's mathematical beauty series of articles, this TF/IDF can be traced back to the concept of relative entropy in information theory. In some literatures it is called "Cross entropy". In English is Kullback-leibler divergence, is named after its two kuhlback and Leiberoux. Relative entropy is used to measure the similarity of two positive functions, and for two identical functions, their relative entropy equals zero. In natural language processing, the relative entropy can be used to measure whether the two commonly used words (grammatical and semantic) are synonymous, or whether the content of the two articles is similar, etc. Using relative entropy, we can get around the most important concept in information retrieval: Word frequency-Reverse document frequency (TF/IDF).
There are two main factors that affect the importance of a word in a document:
term Frequency (TF): How many times this term appears in this document. The larger the TF, the more important the explanation.
document Frequency (DF): That is, how many documents contain the second term. The larger the DF, the less important it is.
Is it easy to understand? The more times a word (term) appears in a document, the more important this term is to the document, such as the word "search", in this
There are many times in the documentation, which means that this document is mainly about this. In an English document, however, this appears more often,
Is it more important to explain? No, this is adjusted by the second factor, the second factor shows that the more documents contain this word (term), the description
This term is too common to differentiate these documents and thus less important.
The truth is clear, let's look at the formula:
W (t,d): The weight of term T in document D
TF (T,D): Frequency of term T in document D
N:total Number of documents
DF (t): The number of documents that contain term T
Some simple models (term count model) ignore the variable of the total number of documents, so that the weight is calculated (in a simpler terms count model the term specific weights does not include the Global parameter. Instead The weights is just the counts of the term occurrences:.)
2. Enter the VSM
We think of the document as a series of words (term), each word has a weight (term weight), different words (terms) according to their own
The weights in the document affect the scoring calculation for document dependencies.
So we think of all the morphemes (term weight) weights of this document as a vector.
Document = {Term1, Term2, ..., term N}
Document Vector = {weight1, weight2, ..., Weight N}
Similarly, we think of query statements as a simple document, as well as vectors.
Query = {Term1, term 2, ..., term N}
Query Vector = {weight1, weight2, ..., Weight N}
We put all the searched document vectors and query vectors into an n-dimensional space, each word being one dimension.
We think the smaller the angle between the two vectors, the greater the correlation.
So we calculate the cosine of the angle as the correlation score, the smaller the angle, the greater the cosine, the higher the score, the greater the correlation.
As long as we compare the cosine of the α,θ, the higher the cosine, the greater the similarity. The formula is as follows:
The calculation methods are listed below:
Set d1= (x1,y1), q= (X2,y2)
We are based on the cosine theorem, cos (α) =cos (A-B)
=cos (a) cos (b) +sin (a) sin (b)
= (X1/SQR (x1*x1+y1*y1)) (X2/SQR (x1*x1+y1*y1)) + (Y1/SQR (x1*x1+y1*y1)) (Y2/SQR (x1*x1+y1*y1))
Merge similar terms, which is the inner product/vector modulo of vector d1 and vector q, which is also the formula SIM (DJ,Q) above.
For example, the query statement has 11 term, a total of three documents to search out. Among the respective weights (term weight), the following table.
T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
D1 0 0 0.477 0 0.477 0.176 0 0 0 0.176 0
D2 0 0.176 0 0.477 0 0 0 0 0.954 0 0.176
D3 0 0.176 0 0 0 0.176 0 0 0 0.176 0.176
Q 0 0 0 0 0 0.176 0 0 0.477 0 0.176
Thus, the correlations between the three documents and the query statements were calculated as:
So the document has the highest correlation, first returns, followed by document one, and finally document three.
Reference:
1) TF/IDF:HTTP://EN.WIKIPEDIA.ORG/WIKI/TF-IDF
2) Vsm:http://en.wikipedia.org/wiki/vector_space_model Wikipedia
3) How to determine query relevance: http://www.google.com.hk/ggblog/googlechinablog/2006/06/blog-post_3066.html
4) Network information: http://forfuture1978.javaeye.com
Solr similarity nouns: VSM (vector space model) vector spatial models