Vector Space Model(OrPhrase Vector ModelAs a vector identifier (such as an index), it is an algebraic model used to represent text files. It is used for information filtering, information retrieval, indexing, and association rules. Smart is the first information retrieval system to use this model.
Documents and queries are expressed by vectors.
Each dimension is equivalent to an independent phrase. If this term appears in the document, its value in the vector is not zero. There are already many different methods to calculate these values. These values are called (phrases) weights. One of the well-known algorithms is the tf_idf weight (see the example below ). We define it according to the application.Phrase. A Typical phrase is a single word, keyword, or long phrase. If a word is selected as a phrase, the dimension of the vector is the number of words that appear in the vocabulary. Vector operations can compare various documents through queries.
Based on the assumption of the document similarity theory, the angle deviation between each document vector and the original query vector (the two vectors are of the same type) is compared, the association rules for searching keywords in documents can be calculated. In fact, it is easier to calculate the cosine ratio of the angle between vectors to directly calculate the angle.
Here is the document vector (that is, d in the right figure ).2) and the point multiplication of the query vector (Q in the figure. Yes vector d2 is the modulo of vector Q. The modulo of vector is calculated using the following formula:
Because all vectors in this model are strictly non-negative, if the cosine is zero, it indicates that the query vector and the document vector are orthogonal, that is, they do not match (in other words, is not found in the document ). For more information, see cosine similarity.
Example: TF-IDF weight
In the traditional vector space model proposed by Salton, Wong, and Yang, the weight of a phrase in the file vector is the product of the local parameter and the global parameter, this is the famous TF-IDF model (Word Frequency _ Reverse file frequency ). Document Weight VectorDThat is, where
- Is a phraseTIn the documentDThe frequency (a local parameter)
- Is the reverse file frequency (a global parameter ). It is the total number of files in the file set; it is a phrase containingTThe number of files.
FileDJAnd QueryQThe cosine similarity between them is calculated using the following formula:
In a simple phrase calculation model, the phrase weight does not contain global parameters, but simply calculates the number of times a phrase appears :.
Advantages
Compared with the standard Boolean model, the vector space model has the following advantages:
- Simple Model Based on Linear Algebra
- The weight of a phrase is not binary.
- Allows calculation of continuous similarity between documents and queries
- Allows File Sorting based on possible correlations
- Allow local match
Limitations
Vector space model has the following limitations:
- It is not applicable to long files because its similarity values are not ideal (too small inner product and too high dimension ).
- The search phrase must exactly match the phrase that appears in the file. An incomplete phrase (the substring will lead to a false positive match ).
- Poor semantic sensitivity; files with the same context but different phrases cannot be associated, resulting in "False Negative matching ".
- The order in which phrases appear in the document cannot be expressed in the middle of the vector.
- It is assumed that phrases are statistically independent.
- Weights are obtained intuitively, but not formally.
However, most of these limitations can be solved through the integration of a variety of methods, including mathematical technologies, such as Singular Value Decomposition and Word Database (such as WordNet)
Model-based and extended vector space model
Model-based and extended vector space models include:
- Generalized Vector Space Model
- (Enhanced) topic-based vector space model
- Potential Semantic Analysis
- Potential semantic Indexes
- DSIR Model
- Phrase recognition
- Rocchio Classification