Looking at latent semantic index (LSI) from Singular Value Decomposition (SVD)

Source: Internet
Author: User

1. Introduction to SVD

The Chinese word "SVD decomposition" is a matrix decomposition method. The formula is as follows:

Theorem: If a is M * n-order complex matrix, the M-Order Matrix U and n-order matrix V exist, so that:

A = u * S * V 'where S = diag (σ I, σ 2 ,......, σ r), σ I> 0 (I = 1 ,..., R), r = rank ().

Matrix A is our initial feature matrix. In Text Mining, A is the matrix of the D (document) column of the T (TERM) Row. Each column is an article, each row is a word, and the number of times each cell's current word appears in the current article. U is a matrix of R columns in T rows, V is a matrix of D columns in R rows, and s is the diagonal matrix of R columns in R rows. Here, the size of R is the rank of. In this case, U and V are the singular vectors of A, while S is the singular values of. The orthogonal unit feature vector of AA' is U, and the feature value is S. The orthogonal unit feature vector of a'a is V, and the feature value (same as AA) is ss '. (
What is rank, what is the feature value, and how can this decomposition be obtained)

Note that this formula is equal sign, that is, the left side of the equal sign is equivalent to the right side of the equal sign. In other words, we only changed the original expression A without losing any information. It's like 24 = 2*3*4. Therefore, if LSI directly uses SVD, not only R is uncontrollable, but R is likely to be too large to achieve dimensionality reduction, in fact, this operation not only does not reduce the dimension, but also consumes a lot of computing time. As a matrix decomposition method, SVD is not only used in LSI. MATLAB has direct SVD functions that can be used: [U, S, V]
= SVD ()

 

2. Use of LSI for SVD

LSI makes a slight change to SVD, that is, sorts the r diagonal elements of S and keeps only the first K values (k <R ), then, the numbers r-k are set to zero. At this point, we can prove that the right side of the equation is the best approximation on the left side of the peering equation in the sense of the Least Square. In fact, this process is to arrange the feature values of a dataset (which are characterized by Singular Values in SVD) according to importance. The process of dimensionality reduction is to discard unimportant feature vectors, the remaining feature vector space is the space after dimensionality reduction.

Here, we can get the most important inspiration: LSI achieves dimensionality reduction by dropping unimportant feature vectors, and because feature vectors are calculated based on matrix operations, thereforeIn the process of dimensionality reduction, LSI not only loses information, but also changes information. The dataset after dimensionality reduction is only an approximate but not equivalent form of the original dataset. The larger the dimensionality reduction, the greater the deviation from the original information.

 

3. Applicability of LSI

1) feature Dimensionality Reduction

In essence, LSI maps each feature to a lower-dimensional sub-space. Therefore, it can be said that dimensionality reduction is a field setup. In the land of dimensionality reduction, another hard-working tiller is TFIDF. TFIDF gets the importance of different words through a simple formula (multiplied by two integers, and take the first K most important words, and discard other words, here only the loss of information, there is no change in information. TFIDF is much higher than LSI in terms of execution efficiency, but LSI is better than TFIDF in terms of effect (at least in academia.

However, you must note that,No matter which of the preceding dimensionality reduction methods, information deviation will occur, which affects the accuracy of subsequent classification/clustering. Dimensionality Reduction is designed to greatly improve the running efficiency and save memory space at an acceptable cost.However, we still do not need to reduce the dimensionality when we do not reduce the dimensionality (for example, if you only have thousands of documents to process, there is really no need to reduce the dimensionality)

 

 2) Word relevance Calculation

The result of LSI is transformed to get the correlation between different words (0 ~ A real number between 1). Words with high relevance often have the same meaning. However, do not be confused by the name of "potential Semantics". The so-called potential semantics is only similar in the statistical sense. It is reliable to use a synonym dictionary if you want to get a synonym. The synonym obtained by LSI is not necessarily a synonym (or even different parts of speech), but they often appear in similar scenarios (such as "Warcraft" and "Dota "). However, in fact, the use of LSI for word relevance calculation is not much. On the one hand, there are some frequently used synonym dictionaries, in addition, more people trust the results of Supervised Learning (classification.

 

3) Clustering

I have not seen the scenario of directly using LSI clustering, but the subsequent variants of this series of algorithms, plsi, Lda, do have some clustering. LDA Clustering also has some truth (because it assumes the joint probability distribution of potential topics). It is not suitable to use LSI for clustering. In essence, LSI is looking for feature subspaces, while the clustering method is looking for instance grouping. Although LSI looks like a clustering result, it does not mean what the clustering thinks. An obvious example is that for sample sets with uneven distribution (for example, there are 1000 news articles and 10 literature articles ),
LSI/plsi often produces relatively average results (500 articles in Class A and 600 articles in Class B). In this case, good clustering results cannot be obtained at all. Compared with the traditional k-means clustering method, LSI series algorithms not only have information deviations (loss and change), but also cannot process unevenly distributed sample sets.

For LSI/plsi, clustering does not mean documents, but words. Therefore, a variant of clustering is that when K is set to a large enough size, LSI/plsi can provide word sequences that fall into different subspaces, basically, these words have close semantic relationships. In fact, this usage is basically using dimensionality reduction for word relevance calculation.

 

Note:

Some information in this article comes from Baidu encyclopedia. If you are interested in mathematics, refer to the following links: Singular Value Decomposition, matrix feature value, and determinant.

Tip: the determining factor is the source of all problems. If you cannot understand the formula of the determining factor, you can consider the following:

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.