Vector space Model (VSM) for calculating text similarity

Source: Internet
Author: User

1. Definition

The vector space model is an algebraic model that represents text as an index term vector, and the prototype system smart*.

The definition of a vector space model is simple, and document D, query q are represented by vectors.

Both the query and the document can be converted into a vector representation of the term and its weights, which can be considered as points in the space. The similarity of the query and each document is obtained by distance calculation between vectors.

we assume that the term is independent of each other in the vector space model .

2. Model Building

The key of constructing VSM vector space model is three points:

1. Indexing term Selection

2. Weight calculation (term Weighting): Calculates the weight of each term in each document

3. Query and document similarity calculation (similarity computation)

The above is more difficult to understand, next we will introduce an example of how to build a vector space model, and the vector space model to calculate the text similarity.

2.1 Indexing (Index term)

Indexing items

(1) expressed as a set of multiple term

(2) usually expressed in terms of words, but can also be expressed in other language units

(3) The word (key words) can be regarded as a

For example:

Document D1: I like playing DotA and I like to play lol.

Document D2: I love fruit and I like to take pictures as well.

Here we only use the word segmentation results as an index,

D1 participle: I, like, hit, DotA, also, LoL

D2 participle: I, love, eat, fruit, also, like, take pictures

To use the word breaker as an index:

I, like, hit, DotA, also, LoL, love, eat, fruit, take pictures

About the selection of the index is far more than the word segmentation is so simple, here in order to introduce the VSM is just simplistic, I will discuss the selection of the index in another blog.

2.2 Weight

Different indexing function is different, through the weight to distinguish, usually use TF*IDF to calculate the weight, here is not more said, but simple to use the word frequency to calculate the weight.

Document D 1: (< I,2>,< like,2>,< play,1>,< dota,1>,< also,1>,< play,1>,< lol,1>,< love, 0> < eat,0>,< fruit,0>,< photo,0>)

Document D 2: (< I,1>,< like,1>,< play,0>,< dota,0>,< also,1>,< play,0>,< lol,0>,< love, 1> < eat,1>,< fruit,1>,< photo,1>)

2.3 Building a vector space model

What we call the document - indexing matrix (doc-term matrix)

The Matrix am*n can be made up of M-Documents and N-index items, each of which represents a document and each row represents each of the indexing items.

Have query Q: I like to play ball---> (< i,1>,< like,1>,< play,1>,< dota,0>,< also,0>,< play,0>,< LOL,0>,&L t; love,0>,< eat,0>,< fruit,0>,< photo,0>,< ball,1>)

Document D 1: (< I,2>,< like,2>,< play,1>,< dota,1>,< also,1>,< play,1>,< lol,1>,< love, 0> < eat,0>,< fruit,0>,< photo,0>,< ball,0>)

Document D 2: (< I,1>,< like,1>,< play,0>,< dota,0>,< also,1>,< play,0>,< lol,0>,< love, 1> < eat,1>,< fruit,1>,< photo,1>,< ball,0>)

Calculation of similarity:

The cosine is in the range of [ -1,1], the value is closer to 1, the direction of the two vectors closer to 0, their direction more consistent. The corresponding similarity is also higher, which is called "Cosine similarity".

Document D1 is more similar to Q.

The concept of similarity is widely used, matched, recommended, and clustered.

Vector space Model (VSM) for calculating text similarity

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.