Vector space Model (VSM) for calculating text similarity

Last Update:2016-04-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Definition

The vector space model is an algebraic model that represents text as an index term vector, and the prototype system smart*.

The definition of a vector space model is simple, and document D, query q are represented by vectors.

Both the query and the document can be converted into a vector representation of the term and its weights, which can be considered as points in the space. The similarity of the query and each document is obtained by distance calculation between vectors.

we assume that the term is independent of each other in the vector space model .

2. Model Building

The key of constructing VSM vector space model is three points:

1. Indexing term Selection

2. Weight calculation (term Weighting): Calculates the weight of each term in each document

3. Query and document similarity calculation (similarity computation)

The above is more difficult to understand, next we will introduce an example of how to build a vector space model, and the vector space model to calculate the text similarity.

2.1 Indexing (Index term)

Indexing items

(1) expressed as a set of multiple term

(2) usually expressed in terms of words, but can also be expressed in other language units

(3) The word (key words) can be regarded as a

For example:

Document D1: I like playing DotA and I like to play lol.

Document D2: I love fruit and I like to take pictures as well.

Here we only use the word segmentation results as an index,

D1 participle: I, like, hit, DotA, also, LoL

D2 participle: I, love, eat, fruit, also, like, take pictures

To use the word breaker as an index:

I, like, hit, DotA, also, LoL, love, eat, fruit, take pictures

About the selection of the index is far more than the word segmentation is so simple, here in order to introduce the VSM is just simplistic, I will discuss the selection of the index in another blog.

2.2 Weight

Different indexing function is different, through the weight to distinguish, usually use TF*IDF to calculate the weight, here is not more said, but simple to use the word frequency to calculate the weight.

Document D 1: (< I,2>,< like,2>,< play,1>,< dota,1>,< also,1>,< play,1>,< lol,1>,< love, 0> < eat,0>,< fruit,0>,< photo,0>)

Document D 2: (< I,1>,< like,1>,< play,0>,< dota,0>,< also,1>,< play,0>,< lol,0>,< love, 1> < eat,1>,< fruit,1>,< photo,1>)

2.3 Building a vector space model

What we call the document - indexing matrix (doc-term matrix)

The Matrix am*n can be made up of M-Documents and N-index items, each of which represents a document and each row represents each of the indexing items.

Have query Q: I like to play ball---> (< i,1>,< like,1>,< play,1>,< dota,0>,< also,0>,< play,0>,< LOL,0>,&L t; love,0>,< eat,0>,< fruit,0>,< photo,0>,< ball,1>)

Document D 1: (< I,2>,< like,2>,< play,1>,< dota,1>,< also,1>,< play,1>,< lol,1>,< love, 0> < eat,0>,< fruit,0>,< photo,0>,< ball,0>)

Document D 2: (< I,1>,< like,1>,< play,0>,< dota,0>,< also,1>,< play,0>,< lol,0>,< love, 1> < eat,1>,< fruit,1>,< photo,1>,< ball,0>)

Calculation of similarity:

The cosine is in the range of [ -1,1], the value is closer to 1, the direction of the two vectors closer to 0, their direction more consistent. The corresponding similarity is also higher, which is called "Cosine similarity".

Document D1 is more similar to Q.

The concept of similarity is widely used, matched, recommended, and clustered.

Vector space Model (VSM) for calculating text similarity

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Vector space Model (VSM) for calculating text similarity

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support