International - English

Cart Console

Topic Center

Contact Sales

Home > Others

A simple introduction of vector space Model (VSM) in the calculation of document similarity

Last Update:2018-07-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

C # Real Now:

Http://blog.csdn.net/Felomeng/archive/2009/03/25/4023990.aspx

Vector space Model (VSM)is the most common similarity computing model, which is widely used in natural language processing, and it introduces the principle of similarity calculation between documents.

Suppose there are 10 words: W1,W2,......,W10, and a total of three articles, D1,D2 and D3. Statistical frequency lists (fabricated, for ease of presentation) are as follows:

	W1	W2	W3	W4	W5	W6	W7	W8	W9	W10
D1	1	2		5		7		9
D2		3		4		6	8
D3	10		11		12			13	14	15

The common vector space formula is shown in the following diagram:

Assuming that the similarity between D1 and D2 is computed, then AI and bi represent the word frequencies of the words in D1 and D2 respectively, and we take cosine as an example:

(count the readers themselves, what each number represents is easy to see from the table above)

Why is it called a vector space model? In fact, we can think of each word as a dimension, and the frequency of the word as its value (there is a direction), that is, vector, so that each article of the word and its frequency constitutes an i-dimensional space diagram, two of the similarity of the document is the proximity of two space graphs. If the article is only two dimensions, then the space map can be drawn in a plane rectangular coordinate system, the reader can imagine two only two words of the article drawing to understand.

We see that the formula above is computationally large, especially when the number of terms in the document is large. So how to improve the efficiency of the operation. We can take the dimension reduction method. In fact, as long as we understand the principle of vector space model, it is not difficult to understand the concept of dimensionality reduction. The so-called dimensionality reduction is the reduction of dimensions. Specific to the document similarity calculation, is to reduce the number of words. Commonly used to reduce the dimension of words to functional words and stop words mainly (such as: "", "this" and so on), in fact, to take the strategy of dimensionality reduction in many cases can not only improve efficiency, but also improve accuracy. This is not difficult to understand, such as the following two words (may not be particularly appropriate, forgive me): This is my meal. That's your meal.

If the "This", "that", "You", "I", "yes", "the" are treated as functional words off, then the similarity is 100%. If none is removed, the similarity may be only 60%. And the theme of these two sentences is the same.

Inverted frequency smoothing (inverse Document Frequency) method, which uses the word frequency of all the words in the corpus to adjust the weight of the words in a corpus, can be understood to multiply the frequency of the words in an article by multiplying the global word frequencies and then substituting the formula (because the similarity is a relative value, So just make sure that its value falls between 0 and 1.

This is a simple vector space model, which is used in practical applications in the improved vector space model.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

A simple introduction of vector space Model (VSM) in the calculation of document similarity

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support