Vector space model (VSM)

Source: Internet
Author: User

Vector Space Model(OrPhrase Vector ModelAs a vector identifier (such as an index), it is an algebraic model used to represent text files. It is used for information filtering, information retrieval, indexing, and association rules. Smart is the first information retrieval system to use this model.



Documents and queries are expressed by vectors.

Each dimension is equivalent to an independent phrase. If this term appears in the document, its value in the vector is not zero. There are already many different methods to calculate these values. These values are called (phrases) weights. One of the well-known algorithms is the tf_idf weight (see the example below ). We define it according to the application.Phrase. A Typical phrase is a single word, keyword, or long phrase. If a word is selected as a phrase, the dimension of the vector is the number of words that appear in the vocabulary. Vector operations can compare various documents through queries.

Based on the assumption of the document similarity theory, the angle deviation between each document vector and the original query vector (the two vectors are of the same type) is compared, the association rules for searching keywords in documents can be calculated. In fact, it is easier to calculate the cosine ratio of the angle between vectors to directly calculate the angle.

Here is the document vector (that is, d in the right figure ).2) and the point multiplication of the query vector (Q in the figure. Yes vector d2 is the modulo of vector Q. The modulo of vector is calculated using the following formula:

Because all vectors in this model are strictly non-negative, if the cosine is zero, it indicates that the query vector and the document vector are orthogonal, that is, they do not match (in other words, is not found in the document ). For more information, see cosine similarity.

Example: TF-IDF weight

In the traditional vector space model proposed by Salton, Wong, and Yang, the weight of a phrase in the file vector is the product of the local parameter and the global parameter, this is the famous TF-IDF model (Word Frequency _ Reverse file frequency ). Document Weight VectorDThat is, where

  • Is a phraseTIn the documentDThe frequency (a local parameter)
  • Is the reverse file frequency (a global parameter ). It is the total number of files in the file set; it is a phrase containingTThe number of files.

FileDJAnd QueryQThe cosine similarity between them is calculated using the following formula:

In a simple phrase calculation model, the phrase weight does not contain global parameters, but simply calculates the number of times a phrase appears :.

Advantages

Compared with the standard Boolean model, the vector space model has the following advantages:

  1. Simple Model Based on Linear Algebra
  2. The weight of a phrase is not binary.
  3. Allows calculation of continuous similarity between documents and queries
  4. Allows File Sorting based on possible correlations
  5. Allow local match
Limitations

Vector space model has the following limitations:

  1. It is not applicable to long files because its similarity values are not ideal (too small inner product and too high dimension ).
  2. The search phrase must exactly match the phrase that appears in the file. An incomplete phrase (the substring will lead to a false positive match ).
  3. Poor semantic sensitivity; files with the same context but different phrases cannot be associated, resulting in "False Negative matching ".
  4. The order in which phrases appear in the document cannot be expressed in the middle of the vector.
  5. It is assumed that phrases are statistically independent.
  6. Weights are obtained intuitively, but not formally.

However, most of these limitations can be solved through the integration of a variety of methods, including mathematical technologies, such as Singular Value Decomposition and Word Database (such as WordNet)

Model-based and extended vector space model

Model-based and extended vector space models include:

  • Generalized Vector Space Model
  • (Enhanced) topic-based vector space model
  • Potential Semantic Analysis
  • Potential semantic Indexes
  • DSIR Model
  • Phrase recognition
  • Rocchio Classification



Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.