Latent semantic analysis uses vector semantic space to analyze the relationship between documents and words.
Basic hypothesis: If two words appear multiple times in the same document, then the two words are semantically similar to each other.
LSA uses a large number of text to form a matrix, each line represents a word, a column represents a document, the matrix element can be frequency or TF-IDF, and then the singular value decomposition SVD for matrix reduction, the approximation of the original matrix, at this time the similarity of two words can pass its vector cos value.
dimensionality reduction reasons:
-The original matrix is too large, and the new matrix is the approximation of the original matrix.
-The original matrix has noise, and dimensionality reduction is also the process of denoising.
-Primitive matrix is too sparse
-dimensionality reduction can solve the problem of some synonyms and two semantics.
Derivation:
For a document set can be represented as a matrix x, a behavior word, listed as a document
The dot multiplication of a word vector can represent the similarity of these two words in the document collection. The matrix contains the results of all the word vector point multiplication
The process of dimensionality reduction is in fact the singular value decomposition, matrix X can be decomposed into orthogonal matrix U, V, and a diagonal matrix product
Therefore, the correlation matrix of the word and the text can be expressed as:
Because it is a diagonal matrix, it is definitely a matrix of eigenvectors, and the same is the matrix of eigenvectors. The eigenvalues corresponding to these eigenvectors are the elements in. In summary, the decomposition looks like this:is called a singular value, andand theis called the left singular vector and the right singular vector. by matrix decomposition, it can be seen that the original matrixis only related to line I of the U Matrix, we call the behavior of the first I. in the same vein, the original matrixonly withand we call this column. with theis not a characteristic value, but it is determined by the eigenvalues of all the matrices. When we choose K-Max singular values (which are extracted here), and their corresponding U-v vectors multiply, we can get a K-order approximation of the X-matrix, where the matrix and the X-matrix have a minimum error (i.e. the Frobenius norm of the residual matrix). But what makes sense is that you can map word vectors and document vectors to semantic spaces. The vector is multiplied by a matrix containing k singular values, which is a transformation from high-dimensional space to low-dimensional space, which can be understood as the approximation of a high-dimensional space to a low-dimensional space. Similarly, vectors also exist such a change from high-dimensional space to low-dimensional space. This transformation is summed up in a formula like this: With this transformation, you can do the following things:
- Determine the similarity of the document to the low-dimensional space. Compare vectors with vectors (such as using cosine angles) to derive.
- By comparison and can judge the similarity of words and words.
- With similarity, you can cluster text and documents.
- Given a query string, the similarity between the semantic space and the existing document is calculated.
To compare the similarity between a query string and an existing document, you need to map both the document and the query string to the semantic space, which can be mapped by the following formula for the original document: the inverse matrix of the diagonal matrix can be easily obtained by finding the reciprocal of the 0 elements in the opposite. Similarly, for a query string, the vector of the corresponding word is obtained, then it is mapped to the semantic space according to the formula, and then compared with the document.
The low-dimensional semantic space can be used in the following areas:
- In the low-Uygur semantic space, documents can be compared, which can be used for document clustering and document classification.
- By training on translated documents, you can find similar documents in different languages and can be used for cross-language retrieval.
- Find the relationship between words and words, can be used for synonyms, ambiguous word detection.
- Information retrieval can be done by mapping the query to the semantic space.
- Find the relevance of words from a semantic point of view, which can be used in the "Choice Answer Model" (Multi choice Qustions answering model).
Some of the disadvantages of LSA are as follows:
- The newly generated matrices are poorly interpreted. For example,
-
-
{(car), (truck), (flower)}? {(1.3452 * car + 0.2828 * truck), (flower)}
-
(1.3452 * car + 0.2828 * truck) can be interpreted as "vehicle". At the same time, there are the following transformations
-
{(car), (bottle), (flower)}? {(1.3452 * car + 0.2828 * bottle), (flower)}
The
-
result of this difficult explanation is that SVD is only a mathematical transformation and cannot correspond to a concept in reality.
- LSA cannot catch a word that is more than a phenomenon. In the original word-vector matrix, each word of each document can have only one meaning. For example, the same article "The Chair of Board" and "The Chair maker" Chair will be considered the same. In semantic space, the vector of a word with a multi-meaning phenomenon presents the average of multiple semantics. Correspondingly, if one of these meanings occurs particularly frequently, then the semantic vector is tilted toward it.
- LSA has the disadvantage of a word-bag model that ignores the order of words in an article, or a sentence.
- The LSA's probabilistic model assumes that the distribution of documents and words is subject to the combined normal distribution, but is subject to Poisson distribution from observational data. Thus an improved PLSA of the LSA algorithm uses multiple distributions, which works better than LSA
Excerpt from: http://blog.csdn.net/roger__wong/article/details/41175967
Latent Semantic Analysis LSA