A simple introduction of vector space Model (VSM) in the calculation of document similarity

Source: Internet
Author: User

C # Real Now:

Http://blog.csdn.net/Felomeng/archive/2009/03/25/4023990.aspx

Vector space Model (VSM)is the most common similarity computing model, which is widely used in natural language processing, and it introduces the principle of similarity calculation between documents.

Suppose there are 10 words: W1,W2,......,W10, and a total of three articles, D1,D2 and D3. Statistical frequency lists (fabricated, for ease of presentation) are as follows:

W1

W2

W3

W4

W5

W6

W7

W8

W9

W10

D1

1

2

5

7

9

D2

3

4

6

8

D3

10

11

12

13

14

15

The common vector space formula is shown in the following diagram:

Assuming that the similarity between D1 and D2 is computed, then AI and bi represent the word frequencies of the words in D1 and D2 respectively, and we take cosine as an example:

(count the readers themselves, what each number represents is easy to see from the table above)

Why is it called a vector space model? In fact, we can think of each word as a dimension, and the frequency of the word as its value (there is a direction), that is, vector, so that each article of the word and its frequency constitutes an i-dimensional space diagram, two of the similarity of the document is the proximity of two space graphs. If the article is only two dimensions, then the space map can be drawn in a plane rectangular coordinate system, the reader can imagine two only two words of the article drawing to understand.

We see that the formula above is computationally large, especially when the number of terms in the document is large. So how to improve the efficiency of the operation. We can take the dimension reduction method. In fact, as long as we understand the principle of vector space model, it is not difficult to understand the concept of dimensionality reduction. The so-called dimensionality reduction is the reduction of dimensions. Specific to the document similarity calculation, is to reduce the number of words. Commonly used to reduce the dimension of words to functional words and stop words mainly (such as: "", "this" and so on), in fact, to take the strategy of dimensionality reduction in many cases can not only improve efficiency, but also improve accuracy. This is not difficult to understand, such as the following two words (may not be particularly appropriate, forgive me): This is my meal. That's your meal.

If the "This", "that", "You", "I", "yes", "the" are treated as functional words off, then the similarity is 100%. If none is removed, the similarity may be only 60%. And the theme of these two sentences is the same.

Inverted frequency smoothing (inverse Document Frequency) method, which uses the word frequency of all the words in the corpus to adjust the weight of the words in a corpus, can be understood to multiply the frequency of the words in an article by multiplying the global word frequencies and then substituting the formula (because the similarity is a relative value, So just make sure that its value falls between 0 and 1.

This is a simple vector space model, which is used in practical applications in the improved vector space model.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.