Singular Value Decomposition and application (PCA & amp; LSA)

Last Update:2015-05-23 Source: Internet

Author: User

Tags idf

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Singular Value Decomposition and application (PCA & LSA), decomposing pca

I have saved a lot of mathematical knowledge here. It is recommended that readers with weak mathematics should first look at Chapter 18th of <Introduction to Information Retrieval>. The main mathematical knowledge includes the feature values and feature vectors of the square matrix, the right corner of the square matrix, the Singular Value Decomposition of the general matrix, and the low rank approximation matrix. Here we will mainly explain the two applications of Singular Value Decomposition (PCA) and LSA (potential semantic structure analysis ).

PCA:

For details about PCA, see http://blog.csdn.net/lu597203933/article/details/42544547. Here we mainly want to explain how to solve PCA from the perspective of SVD.

PCA is mainly used to find the spindle with the changing data. We all know that the direction of the spindle is that the sample is normalized by zscore (that is, the mean value after normalization is 0, and the variance is 1) the feature vector corresponding to the maximum feature value corresponding to the data covariance matrix. However, we know the concept of svd is as follows:

Assume that A is A matrix of m * n, m is the number of samples, and n is the number of features. Then U is m * m square matrix and each column is A * AT Unit orthogonal feature vector, V is A matrix of n * n and each column is an AT * A unit orthogonal feature vector, and D is A * AT (or AT *) is a diagonal array consisting of the arithmetic square root of the feature value and is in descending order.

Therefore, through data normalization, AT * A is the covariance matrix corresponding to multiple n-dimensional features. Therefore, the topK column of V is the first K Spindle of PCA dimensionality reduction. we mark it as [u1, u2, u3 ,... Uk] the ui is a vector. For the sample data x (I) (n-dimensional), [x (I) T * u1, x (I) T * u2 ,.... X (I) T * uk] is the data after dimensionality reduction (k-dimension ).

LSA:

LSA (latent semanticanalysis) can also be written as LSI (latent semantic indexing, latent semantic index ). It is mainly used to solve the problem of multi-word and multi-word in vector model space. Several Basic concepts are provided:

Vector Model space:Vector space model is the most commonly used Retrieval Method in Information Retrieval. In the retrieval process, all documents and queries in document set D are represented as word-based vectors, the feature value is the TF-IDF value for each word, and then uses the vector space model (that is, to calculate the similarity between the vector of the query q and the vector of each document di) to measure the similarity (Cosine distance) between a document and a query to obtain the document most relevant to the given query.

TF-IDF: Word FrequencyTerm frequency (TF) refers to the frequency at which a given word appears in the file. This number is correctWord Count(Term count) normalization to prevent it from being biased towards long files. (A word may have a higher number of words in a long file than a short file, regardless of whether the word is important or not .) For words in a specific file, its importance can be expressed:

Ni and j indicate the number of times word I appears in document j.

Reverse file frequency(Inverse document frequency, IDF) is a measure of the general importance of words. The IDF of a specific word can be divided by the total number of files containing the word by the number of files, and then obtain the quotient:

Where

· | D |: Total number of documents in the Corpus

· Number of files whose denominator contains words (that is, the number of files) if the word is not in the corpus, the denominator is zero. Therefore, it is generally used as the denominator.

Then calculate the product of TF and IDF.

The frequency of high words in a specific file and the low file frequency of the word in the entire file set can produce a high-weight TF-IDF. Therefore, TF-IDF tends to filter out common words and retain important words.

Boolean Model:

This is also called a Boolean model.

Let's get down to the truth,The ever-changing combination of various words hides the potential semantic structure space (also known as the concept space), which can be used to estimate the structure and remove noise.

For example, two samples x (1) and x (2), one containing study and the other containing learn, then calculate their similarity using cosine and the result is 0; however, if we project them to a vector in a certain direction, for example, y = x, then they become positively correlated, for example.

At present, the Concept Space is estimated through the statistical method of Singular Value Decomposition. I am not clear about the specific cause of the estimation through Singular Value Decomposition. At least I can only write a little bit to calculate it if I fail to derive it from mathematics like PCA. (For more in-depth understanding, please comment. Thank you !)

Assume that we have A word term-document matrix A, with word items and 50 documents, in this way, the original matrix is 100 w * 50 w (the word is represented by 1 in the document, not 0 ). In this case, we perform Singular Value Decomposition to obtain U (100 w * 100 w), D (100 w * 50 w), V (50 w * 50 w ); A maximum of singular values can be entered. The idea of LSA is that we generally keep the Singular Values of hundreds of topics (for example, 100 ), then we use the result of U100w * 100 * D100 * 100 * V100 * 50 w to represent the potential semantic structure space. At this time, the word term-document matrix is still 100 w * 50 w, it takes into account the semantic information of the data and solves the problem of multiple words. In this way, we can re-calculate the similarity of each document. The following describes the specific steps of the algorithm.

How should we understand it? What we first understand is A *, its column j in line I represents the number of words that appear in the document together with the entry j, AT * A's line I j indicates the number of common words that appear in the document I and j. What are their respective feature vectors U and V?

We can regard 100 as a potential topic (or concept ), then column j of row I of U can be seen as the importance of the word I in the topic j, column j of row I in Column V indicates the importance of topic I in document j. in D, the singular value I represents the importance of the topic I. A specific number is given below.

The corresponding Singular Value Decomposition matrix is (the singular value of top3 is given here ):

The new data space is represented as: for example, Anew () is calculated)

Anew (0.15) = 3.19*0.35*0.27 + (-2.61) * 0.32 * (-0.04) + 2.0*0.41) indicates the importance of topic I in a document * The importance of topic I * the sum of the 1st terms in topic I (I = 1, 2 .. k ). Here, we discard the importance of the subsequent theme (if all theme is retained, it is the original matrix), which is a way to remove noise, that is, irrelevant information (such as misuse of words or occasional appearance of irrelevant words), semantic structures are gradually presented. In addition, the number of non-zero singular values is the rank of the matrix, it can also be seen as the number of linear irrelevant vectors in the matrix, and the matrix is equivalent to the coordinate axis. The rank of the original matrix is 9, so now the rank of the matrix is 3. Through lsa, the original 9-dimensional space of the data is changed to a 3-dimensional space, compared with the traditional vector space, the dimension of the potential semantic space is smaller, and the semantic relationship is clearer. Similar to the pca process, dimensionality reduction is the main axis for finding data changes, the representation of the document in the high-dimensional vector space model can be projected into the low-dimensional potential semantic space.

References:

1: Books: Introduction to Information Retrieval

2: http://blog.csdn.net/abcjavaser/article/details/8131087the beauty of Mathematics

3: cosine (5)-powerful Matrix Singular Value Decomposition (SVD) and Its Application

4: http://www.cnblogs.com/kemaswill/archive/2013/04/17/3022100.html

5: http://read.pudn.com/downloads126/sourcecode/graph/texture_mapping/536657/pLSA/pLSA.pdf

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More