C # Real Now:
Http://blog.csdn.net/Felomeng/archive/2009/03/25/4023990.aspx
Vector space Model (VSM)is the most common similarity computing model, which is widely used in natural language processing, and it introduces the principle of similarity calculation between documents.
Suppose there are 10 words: W1,W2,......,W10, and a total of three articles, D1,D2 and D3. Statistical frequency lists (fabricated, for ease of presentation) are as follows:
|
W1 |
W2 |
W3 |
W4 |
W5 |
W6 |
W7 |
W8 |
W9 |
W10 |
D1 |
1 |
2 |
|
5 |
|
7 |
|
9 |
|
|
D2 |
|
3 |
|
4 |
|
6 |
8 |
|
|
|
D3 |
10 |
|
11 |
|
12 |
|
|
13 |
14 |
15 |
The common vector space formula is shown in the following diagram:
Assuming that the similarity between D1 and D2 is computed, then AI and bi represent the word frequencies of the words in D1 and D2 respectively, and we take cosine as an example:
(count the readers themselves, what each number represents is easy to see from the table above)
Why is it called a vector space model? In fact, we can think of each word as a dimension, and the frequency of the word as its value (there is a direction), that is, vector, so that each article of the word and its frequency constitutes an i-dimensional space diagram, two of the similarity of the document is the proximity of two space graphs. If the article is only two dimensions, then the space map can be drawn in a plane rectangular coordinate system, the reader can imagine two only two words of the article drawing to understand.
We see that the formula above is computationally large, especially when the number of terms in the document is large. So how to improve the efficiency of the operation. We can take the dimension reduction method. In fact, as long as we understand the principle of vector space model, it is not difficult to understand the concept of dimensionality reduction. The so-called dimensionality reduction is the reduction of dimensions. Specific to the document similarity calculation, is to reduce the number of words. Commonly used to reduce the dimension of words to functional words and stop words mainly (such as: "", "this" and so on), in fact, to take the strategy of dimensionality reduction in many cases can not only improve efficiency, but also improve accuracy. This is not difficult to understand, such as the following two words (may not be particularly appropriate, forgive me): This is my meal. That's your meal.
If the "This", "that", "You", "I", "yes", "the" are treated as functional words off, then the similarity is 100%. If none is removed, the similarity may be only 60%. And the theme of these two sentences is the same.
Inverted frequency smoothing (inverse Document Frequency) method, which uses the word frequency of all the words in the corpus to adjust the weight of the words in a corpus, can be understood to multiply the frequency of the words in an article by multiplying the global word frequencies and then substituting the formula (because the similarity is a relative value, So just make sure that its value falls between 0 and 1.
This is a simple vector space model, which is used in practical applications in the improved vector space model.