Latent Semantic Analysis (LSA) Model Learning notes

Source: Internet
Author: User

Latent Semantic Analysis (LSA) Model Learning notes

Latent Semantic analysis models, implicit semantic analyses, which we often call LSA models. There's his brother pLSA and the LDA model, which we'll talk about later. These are the more classic models in NLP! Learning this model, the main summary of the three aspects: where can the LSA model be applied? The theoretical part of the LSA, as well as the pros and cons of LSA analysis.


1. Application of LSA


The LSA can reduce the dimension of the sample in the VSM, and can find the implied semantic dimension from the text.

In VSM, the document is represented as a multi-bit vector consisting of the probability of the occurrence of a feature word, and the advantage of this method is that it can convert a text into a numerical vector, then do some similarity calculation, cluster classification and so on.

However, in the VSM, it is impossible to deal with the problem of polysemy and one-word word. For example, in the VSM model, "quilt" and "futon" are two completely different dimensions, and "Notebook" (notebook or laptop?) is also represented in the same dimension, so it is not possible to embody the implied semantics in the text.

Therefore, theLSA model can be used to excavate the semantic information in the text, alleviate the word polysemy and the multi-word problem .


2. LSA's theoretical analysis


The theoretical part of LSA model is based on singular value decomposition SVD, this SVD is very common in the field of data mining, SVD-based algorithm also has a well-known dimensionality reduction algorithm: Principal component Analysis method, PCA (Primary Component). Moreover, I think the PCA and LSA are very similar in some way, except that LSA explicitly applies the background to NLP.

Step 1: In the VSM model, a text is represented as a vector, and many of the text is represented as a matrix C. Each column in C is a text, each line even one term.

Step 2: We do the SVD decomposition of Matrix C as follows:


Learning the matrix theory will be until, in the middle of the Sigma is a C's eigenvalues consisting of a diagonal matrix. Assuming that the C matrix has r eigenvalues, we arrange the R eigenvalues from large to small, the first k reserved, and the back r-k to zero, resulting in a sigman_k.

Step 3: We calculate an approximate decomposition matrix, as follows:


Since there are only K non-0 values in Sigma_k, the rank (rank) of c_k does not exceed K.


The new c_k is the new matrix that we have extracted through the LSA model, and C_k is the new low-dimensional implicit semantic space . In this space, each singular value corresponds to the weight of each "semantic" dimension, and we just set the less important weight to zero, preserving only the most important dimension information, so that we can get a better representation of the document.


3. Advantages and disadvantages of the LSA model


Advantages: The original text feature space can be reduced to a low uygur semantic space, alleviate the word polysemy and a multi-word problem.


disadvantage: in the SVD decomposition time, particularly time-consuming, and generally a text feature matrix dimension will be particularly large, SVD at this time more time-consuming;

Furthermore, the LSA lacks a rigorous mathematical-statistical basis.





Latent Semantic Analysis (LSA) Model Learning notes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.