(1) expressed as a set of multiple term
(2) usually expressed in terms of words, but can also be expressed in other language units
(3) The word (key words) can be regarded as a
For example:
Document D1: I like playing DotA and I like to play lol.
Document D2: I love fruit and I like to take pictures as well.
Here we only use the word segmentation results as an index,
D1 participle: I, like, hit, DotA, also, LoL
D2 participle: I, love, eat, fruit, also, like, take pictures
To use the word breaker as an index:
I, like, hit, DotA, also, LoL, love, eat, fruit, take pictures
About the selection of the index is far more than the word segmentation is so simple, here in order to introduce the VSM is just simplistic, I will discuss the selection of the index in another blog.
Different indexing function is different, through the weight to distinguish, usually use TF*IDF to calculate the weight, here is not more said, but simple to use the word frequency to calculate the weight.
Document D 1: (< I,2>,< like,2>,< play,1>,< dota,1>,< also,1>,< play,1>,< lol,1>,< love, 0> < eat,0>,< fruit,0>,< photo,0>)
Document D 2: (< I,1>,< like,1>,< play,0>,< dota,0>,< also,1>,< play,0>,< lol,0>,< love, 1> < eat,1>,< fruit,1>,< photo,1>)
What we call the document - indexing matrix (doc-term matrix)
The Matrix am*n can be made up of M-Documents and N-index items, each of which represents a document and each row represents each of the indexing items.
Have query Q: I like to play ball---> (< i,1>,< like,1>,< play,1>,< dota,0>,< also,0>,< play,0>,< LOL,0>,&L t; love,0>,< eat,0>,< fruit,0>,< photo,0>,< ball,1>)
Document D 1: (< I,2>,< like,2>,< play,1>,< dota,1>,< also,1>,< play,1>,< lol,1>,< love, 0> < eat,0>,< fruit,0>,< photo,0>,< ball,0>)
Document D 2: (< I,1>,< like,1>,< play,0>,< dota,0>,< also,1>,< play,0>,< lol,0>,< love, 1> < eat,1>,< fruit,1>,< photo,1>,< ball,0>)
Calculation of similarity:
The cosine is in the range of [ -1,1], the value is closer to 1, the direction of the two vectors closer to 0, their direction more consistent. The corresponding similarity is also higher, which is called "Cosine similarity".
Document D1 is more similar to Q.
The concept of similarity is widely used, matched, recommended, and clustered.