Vector space model (VSM)

Last Update:2014-08-06 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Vector Space Model(OrPhrase Vector ModelAs a vector identifier (such as an index), it is an algebraic model used to represent text files. It is used for information filtering, information retrieval, indexing, and association rules. Smart is the first information retrieval system to use this model.

Documents and queries are expressed by vectors.

Each dimension is equivalent to an independent phrase. If this term appears in the document, its value in the vector is not zero. There are already many different methods to calculate these values. These values are called (phrases) weights. One of the well-known algorithms is the tf_idf weight (see the example below ). We define it according to the application.Phrase. A Typical phrase is a single word, keyword, or long phrase. If a word is selected as a phrase, the dimension of the vector is the number of words that appear in the vocabulary. Vector operations can compare various documents through queries.

Based on the assumption of the document similarity theory, the angle deviation between each document vector and the original query vector (the two vectors are of the same type) is compared, the association rules for searching keywords in documents can be calculated. In fact, it is easier to calculate the cosine ratio of the angle between vectors to directly calculate the angle.

Here is the document vector (that is, d in the right figure )._{2) and the point multiplication of the query vector (Q in the figure. Yes vector d_{2 is the modulo of vector Q. The modulo of vector is calculated using the following formula:}}

Because all vectors in this model are strictly non-negative, if the cosine is zero, it indicates that the query vector and the document vector are orthogonal, that is, they do not match (in other words, is not found in the document ). For more information, see cosine similarity.

Example: TF-IDF weight

In the traditional vector space model proposed by Salton, Wong, and Yang, the weight of a phrase in the file vector is the product of the local parameter and the global parameter, this is the famous TF-IDF model (Word Frequency _ Reverse file frequency ). Document Weight VectorDThat is, where

Is a phraseTIn the documentDThe frequency (a local parameter)
Is the reverse file frequency (a global parameter ). It is the total number of files in the file set; it is a phrase containingTThe number of files.

FileD_JAnd QueryQThe cosine similarity between them is calculated using the following formula:

In a simple phrase calculation model, the phrase weight does not contain global parameters, but simply calculates the number of times a phrase appears :.

Advantages

Compared with the standard Boolean model, the vector space model has the following advantages:

Simple Model Based on Linear Algebra
The weight of a phrase is not binary.
Allows calculation of continuous similarity between documents and queries
Allows File Sorting based on possible correlations
Allow local match

Limitations

Vector space model has the following limitations:

It is not applicable to long files because its similarity values are not ideal (too small inner product and too high dimension ).
The search phrase must exactly match the phrase that appears in the file. An incomplete phrase (the substring will lead to a false positive match ).
Poor semantic sensitivity; files with the same context but different phrases cannot be associated, resulting in "False Negative matching ".
The order in which phrases appear in the document cannot be expressed in the middle of the vector.
It is assumed that phrases are statistically independent.
Weights are obtained intuitively, but not formally.

However, most of these limitations can be solved through the integration of a variety of methods, including mathematical technologies, such as Singular Value Decomposition and Word Database (such as WordNet)

Model-based and extended vector space model

Model-based and extended vector space models include:

Generalized Vector Space Model
(Enhanced) topic-based vector space model
Potential Semantic Analysis
Potential semantic Indexes
DSIR Model
Phrase recognition
Rocchio Classification

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Vector space model (VSM)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support