SOLR similarity nouns: VSM (vector space model) vector spatial models

Last Update:2015-06-15 Source: Internet

Author: User

Tags solr idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Recently want to learn the next Lucene, the previous run of the demo will feel very magical, what principle, especially to find the highest similarity, the best results. Simply jump directly to this question, a lot of data are mentioned in the VSM (vector space model) is a vector spatial models, according to this model can be the search results of the optimization of the screening, and currently do not know how to prove that only by virtue of imagination should be this way.

1. Take a look at TF/IDF

Let's take a look at a concept called TF/IDF, which is typically used as a way to calculate the weight of a search keyword in a document or a whole query phrase. A few days ago saw Wu Teacher's mathematical beauty series of articles, this TF/IDF can be traced back to the concept of relative entropy in information theory. In some literatures it is called "Cross entropy". In English is Kullback-leibler divergence, is named after its two kuhlback and Leiberoux. Relative entropy is used to measure the similarity of two positive functions, and for two identical functions, their relative entropy equals zero. In natural language processing, the relative entropy can be used to measure whether the two commonly used words (grammatical and semantic) are synonymous, or whether the content of the two articles is similar, etc. Using relative entropy, we can get around the most important concept in information retrieval: Word frequency-Reverse document frequency (TF/IDF).

There are two main factors that affect the importance of a word in a document:
term Frequency (TF): How many times this term appears in this document. The larger the TF, the more important the explanation.
document Frequency (DF): That is, how many documents contain the second term. The larger the DF, the less important it is.
Is it easy to understand? The more times a word (term) appears in a document, the more important this term is to the document, such as the word "search", in this
There are many times in the documentation, which means that this document is mainly about this. In an English document, however, this appears more often,
Is it more important to explain? No, this is adjusted by the second factor, the second factor shows that the more documents contain this word (term), the description
This term is too common to differentiate these documents and thus less important.

The truth is clear, let's look at the formula:

W (t,d): The weight of term T in document D

TF (T,D): Frequency of term T in document D

N:total Number of documents

DF (t): The number of documents that contain term T

Some simple models (term count model) ignore the variable of the total number of documents, so that the weight is calculated (in a simpler terms count model the term specific weights does not include the Global parameter. Instead The weights is just the counts of the term occurrences:.)

2. Enter the VSM

We think of the document as a series of words (term), each word has a weight (term weight), different words (terms) according to their own
The weights in the document affect the scoring calculation for document dependencies.
So we think of all the morphemes (term weight) weights of this document as a vector.
Document = {Term1, Term2, ..., term N}
Document Vector = {weight1, weight2, ..., Weight N}
Similarly, we think of query statements as a simple document, as well as vectors.
Query = {Term1, term 2, ..., term N}
Query Vector = {weight1, weight2, ..., Weight N}
We put all the searched document vectors and query vectors into an n-dimensional space, each word being one dimension.

We think the smaller the angle between the two vectors, the greater the correlation.
So we calculate the cosine of the angle as the correlation score, the smaller the angle, the greater the cosine, the higher the score, the greater the correlation.

As long as we compare the cosine of the α,θ, the higher the cosine, the greater the similarity. The formula is as follows:

The calculation methods are listed below:

Set d1= (x1,y1), q= (X2,y2)

We are based on the cosine theorem, cos (α) =cos (A-B)

=cos (a) cos (b) +sin (a) sin (b)

= (X1/SQR (x1*x1+y1*y1)) (X2/SQR (x1*x1+y1*y1)) + (Y1/SQR (x1*x1+y1*y1)) (Y2/SQR (x1*x1+y1*y1))

Merge similar terms, which is the inner product/vector modulo of vector d1 and vector q, which is also the formula SIM (DJ,Q) above.

For example, the query statement has 11 term, a total of three documents to search out. Among the respective weights (term weight), the following table.

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11
D1 0 0 0.477 0 0.477 0.176 0 0 0 0.176 0
D2 0 0.176 0 0.477 0 0 0 0 0.954 0 0.176
D3 0 0.176 0 0 0 0.176 0 0 0 0.176 0.176
Q 0 0 0 0 0 0.176 0 0 0.477 0 0.176

Thus, the correlations between the three documents and the query statements were calculated as:

So the document has the highest correlation, first returns, followed by document one, and finally document three.

Reference:

1) TF/IDF:HTTP://EN.WIKIPEDIA.ORG/WIKI/TF-IDF

2) Vsm:http://en.wikipedia.org/wiki/vector_space_model Wikipedia

3) How to determine query relevance: http://www.google.com.hk/ggblog/googlechinablog/2006/06/blog-post_3066.html

4) Network information: http://forfuture1978.javaeye.com

Solr similarity nouns: VSM (vector space model) vector spatial models

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More