Latent Semantic Analysis Note (LSA)

Source: Internet
Author: User
Tags idf

1 LSA Introduction

The LSA (latent semantic analysis) Latent semantic parsing, also known as LSI (Latent semantic index), is Scott Deerwester, Susan T. A new index and retrieval method was proposed by Dumais and others in 1990. The method, like the traditional vector space model, uses vectors to represent words (terms) and documents, and to determine the relationship between the words and documents through the relationships between the vectors (such as the angle). The LSA maps words and documents to potential semantic spaces, thus eliminating some "noise" in the original vector space and improving the accuracy of the information retrieval.

2 Disadvantages of the traditional method

The traditional vector space model uses exact word matching, that is, the exact match between the words entered by the user and the words in the vector space. Because of the existence of Polysemy (polysemy) and a polysemy (synonymy), the model can not provide users with semantic-level retrieval. For example, the user searches for "automobile", i.e. the car, the traditional vector space model simply returns the page containing the word "automobile", and the page that actually contains the word "car" may be required by the user.

Here is an example of LDA primitive paper[1]:

is a term-document matrix, x means that the word appears in the corresponding file, the asterisk indicates that the word appears in the query, and when the user enters the query "IDF in computer-based information look up", The user is looking for pages related to IDF (document frequency) in the information retrieval, and documents 2 and 3 contain two words in the query, respectively, and therefore should be returned, and document 1 does not contain any words in the query and therefore will not be returned. But let's take a closer look and see that in document 1, Access, retrieval, indexing, database are very similar in terms of query similarity, where retrieval and look up are synonyms. Clearly, from the user's point of view, document 1 should be a related document and should be returned. Look again at document 2:computer information theory, although contains a word information in the query, but document 2 is not related to IDF or information retrieval, it is not a document that the user needs and should not be returned. From the above analysis can be seen, in this search, and query related to the document 1 is not returned to the user, and no query-independent document 2 is returned to the user. This is how synonyms and polysemy lead to a decline in the accuracy of traditional vector space model retrieval.

3 How LSA resolves these issues

The purpose of the LSA latent semantic analysis is to find out what the word (terms) really means in the document and query, that is, the latent semantics, so as to solve the problem described in the previous section. Specifically, a large collection of documents is modeled with a reasonable dimension and the words and documents are represented in the space, such as 2000 documents, 7,000 index words, and the LSA uses a vector space with a dimension of 100 to represent the document and the word to that space, and then to retrieve the information in that space. The process of representing the document into this space is the SVD singular value decomposition and the dimensionality reduction process. dimensionality reduction is the most important step in LSA analysis by reducing the "noise" in the document, that is, irrelevant information (such as the misuse of words or the occasional occurrence of unrelated words), the semantic structure is gradually presented. Compared with traditional vector space, the dimension of latent semantic space is smaller and the semantic relation is more explicit.

4 SVD decomposition [2]

SVD decomposition as the basic knowledge of the LSA, I will use it as an article, which can be found here.

5 LSA technical details [1][3]

This section focuses on the theoretical aspects of the LSA technical details, with specific code-level analysis and practice discussed in section 7th.

The LSA steps are as follows:

1. Analyze the collection of documents and establish the term-document matrix.

2. The singular value decomposition of the term-document matrix.

3. Reduce the dimensionality of the SVD-decomposed matrix, which is the low-order approximation referred to in the singular value decomposition section.

4. Construct a potential semantic space using a reduced-dimensionality matrix, or reconstruct the term-document matrix.

The following is an example of the introduction to latent Semantic analysis, which describes the complete LSA step, which I added after the example:

Suppose the document collection is as follows:

The original term-document matrix is as follows:

The singular value decomposition is performed on it:

Then, for the decomposed matrix, the maximum two singular values of {S} are preserved, and the corresponding {w}{p} matrices, note that {P} needs to be transpose in the formula.

After this step, we have two methods of processing, the thesis introduction to latent the Semantic analysis is to multiply the reduced dimension of the three matrix, re-construct the {X} matrix as follows:

Observe the {X} matrix and the {x^} matrix to discover:

The HUMAN-C2 value in {X} is 0 because C2 does not contain human words, but human-c2 in {x^} is 0.40, indicating that human and C2 have a certain relationship, why? Because C2: "A Survey of user opinion of computer system response time" contains the user word, and human is the approximate word, the value of HUMAN-C2 is increased. Similarly, you can analyze other words that have changed in the {x^} value.

The above analysis method clearly shows the LSA effect, that is, the latent semantics in {x^}, and then wants to create a latent semantic space and retrieve the information in that space. Here is an example of comparing two words:

The singular value decomposition form is: X = T S dt,t represents term,s for the single value matrix, and D for Document,dt to represent the transpose of D. The value of the two-row vector points of X represents the degree to which two words appear together in a document. For example, T1 in the D1 10 words, T2 in D1 appear 5 times, T3 in D1 appear 0 words, then only consider in D1 dimension value, T1 (dot) t2=50,t1 (dot) t2=0, obviously T1 and T2 more similar, T1 and T3 is not so similar. Then the matrix x (dot) XT can be used to find out the similarity of all words and words. And the formula that is decomposed by singular values:

X (dot) XT = t (dot) S2 (dot) TT = TS (dot) (TS) t

The above formula shows that when we want to find the X (dot) xt (I,J) elements, we can point to the first and J columns of the TS matrix. So we can think of the line of the TS Matrix as the coordinate of the term, which is the coordinate of the latent semantic space. Similarly we can also introduce XT (dot) X = D (dot) S2 (dot) DT, so that the DS line represents the coordinates of the document.

In this way, we obtain the coordinates of all the documents and words in the latent semantic space, and then we can judge the similarity of two objects by the angle between the vectors, and the method is the same as the traditional vector space model. The next step in the main discussion is to retrieve the text.

A user-entered search statement is called pseudo-text because it is also composed of multiple words, similar to text. So the natural idea is to convert the pseudo-text to document coordinates, and then retrieve the relevant document for that pseudo-text by comparing the pseudo-document with the space between each document. Set XQ represents a column vector of pseudo-text, where the column represents the index word for a collection of documents, and the value of the column represents the number of occurrences of that index word in the pseudo-text. For example, a document collection has the index word {t1,t2,t3}, and the pseudo-text is t1,t3,t2,t1, then xq={2,1,1}. After getting XQ, pass the formula

Dq = XqT T S-1

Calculates the document coordinates of a pseudo-document. where T and S represent the matrix obtained in singular decomposition (s = T s DT). Note that the above formula S-1 represents the inverse matrix of S.

When DQ is computed, it is possible to iterate over all the documents in the DQ and document collections, calculating the cosine angle between the two.

6 LSA Practice

This section focuses on the implementation of LSA, programming languages using C + +, environment Linux GCC, using the GNU scientific Library[5]. This section of code can be found in http://code.google.com/p/lsa-lda/.

1. Create a term-document matrix

The LSA is based on the vector space model, so you first need to create a term-document matrix of M x N, where the rows represent each word and the columns represent each document. The value of the matrix is equal to the TF*IDF value of the corresponding word. The collection of documents to be retrieved is placed in the Corpus folder under the program root, one file per document.

First, you need to create a word list of the corpus, which is the column vector of the t-d matrix, and each word corresponds to an ID.

[Code=cpp]

createvectorspace.cc

Function int Createkeywordmap ()

Iterate through each document

while ((Ent=readdir (Currentdir))!=null)

{

Omit. and..

if (strcmp (Ent->d_name, ". &quot ==0) | | (strcmp (Ent->d_name, "... &quot ==0))

Continue

Else

{

Read each file in directory ' Corpus '

string filename = "./corpus/";

FileName + = ent->d_name;

Ifstream in (Filename.c_str ());

Check if File Open succeeded

if (!in)

{

cout<< "error, cannot open input file" <<endl;

return-1;

}

Parse (); Analyze words

[/code]

In the process of looping, identify each word and determine whether the word is stop word. Stop word in English can be found in ftp://ftp.cs.cornell.edu/pub/smart/english.stop.

[Code=cpp]

createvectorspace.cc

Function Parse ()

Read one char each time

Then recognize a word and check if it's in the stop list

void Parse (Ifstream *in,int *wordindex)

{

String Pendingword;

Char ch;

while (1)

{

......

if (! Letter (CH))/*after recognized a word*/

{

if (!stoplist.count (Pendingword))

{

/*if not exist in the list*/

if (Wordlist.find (pendingword) = = Wordlist.end ())

{

Wordlist.insert (Make_pair (Pendingword,*wordindex));

(*wordindex) + +;

}

}

......

[/code]

Next we need to deal with the word, because the English word has a prefix and suffix, such as the word's singular plural (book->books), the past time (like->liked), these words although the form is different but the meaning is identical, therefore must treat them in the same form, namely the word prototype. The associated algorithm is the porter stemming[6] algorithm.

After getting the list of words, we can construct the t-d matrix, which reads each document sequentially, encounters the words that exist in the word list, and adds 1 to the corresponding matrix element. Several functions of GSL are used here, which can be referenced in the GSL Manual [5].

[Code=cpp]

createvectorspace.cc

Function Creatematrix ()

gsl_matrix* Creatematrix ()

{

......

Allocating t-d matrix Space

Gsl_matrix * MTX = Gsl_matrix_alloc (Wordlist.size (), doclist.size ());

map<string, Int>::const_iterator map_it = Doclist.begin ();

For each document

while (map_it! = Doclist.end ())

{

.....

If the current word exists in the word list

if (Wordlist.find (pendingword)! = Wordlist.end ())

{

The corresponding cell value of the matrix plus 1

Gsl_matrix_set (MTX, Wordlist[pendingword], Map_it->second,

Gsl_matrix_get (MTX, Wordlist[pendingword], Map_it->second) +1);

Wordcount[map_it->second] + = 1;

}

......

[/code]

The t-d matrix is now created, but the matrix cell value is the frequency at which the word appears in the document, so the next step is to ask for the TF*IDF value of each word [7]. TF represents the frequency at which a word appears in a document, and IDF is inverse document frequency, which means that if a word appears in many documents, it is less valuable to differentiate the document. Specific formula:

[Code=cpp]

Svd. Cc

Function Createtfidfmatrix ()

gsl_matrix* Createtfidfmatrix ()

{

......

Double termfrequence = Gsl_matrix_get (mtx,i,j)/wordcount[j];

Double IDF = log (double) doclist.size ()/(double) getdocumentfrequence (mtx,i));

Gsl_matrix_set (MTX,I,J,TERMFREQUENCE*IDF);

......

[/code]

This completes the t-d matrix creation.

2. SVD decomposition

SVD decomposition uses the Gsl_linalg_sv_decomp function in the GSL library

[Code=cpp]

svd.cc

Function COUNTSVD (Gsl_matrix *)

void Countsvd (gsl_matrix* mtx)

{

s = U s v^t so first let ' S allocate u,s,v these three matrix

V_MTX = Gsl_matrix_alloc (Doclist.size (), doclist.size ()); /*v is a n by n matrix*/

S_VCT = Gsl_vector_alloc (Doclist.size ()); /*s is stored in a n-d vector*/

Gsl_vector * Workspace = Gsl_vector_alloc (Doclist.size ()); /* Workspace for GSL function*/

Gsl_linalg_sv_decomp (MTX, V_MTX, S_VCT, Workspace);

}

[/code]

3. Dimension reduction

dimensionality reduction in the program you achieve is very simple, that is, to the matrix (because it is a diagonal matrix, so the program is expressed as a vector) is assigned a value of 0.

[Code=cpp]

svd.cc

Function Reducedim (int)

void Reducedim (int keep)

{

for (int i=keep;i<doclist.size (); i++)

Gsl_vector_set (s_vct,i,0);

}

[/code]

4. Enquiry

After SVD decomposition is complete, we have obtained the latent semantic space, then we can accept the user's input, convert the pseudo-text to the document coordinates, and then find the relevant document by comparing the angle of the vector.

[Code=cpp]

void query (string query)

{

Transform query into LSA space

Istringstream stream (query);

string Word;

Create a GSL vector for XQ, xq a column vector representing pseudo-text

Gsl_vector * Q_vct = Gsl_vector_alloc (Wordlist.size ());

Creating GSL vectors for DQ, dq representing document vectors for pseudo-text

Gsl_vector * D_vct = Gsl_vector_alloc (LSD);

First calculate the XQ

while (stream >> word)

{

if (Wordlist.count (word)!=0)/*word is in the list*/

Gsl_vector_set (Q_vct,wordlist[word],

Gsl_vector_get (Q_vct,wordlist[word]) +1);

}

Dq = Xq ' T s^-1

and beg Xq ' multiply t

for (int i = 0; i < LSD; i++)

{

Double sum = 0;

for (int j = 0; J < Wordlist.size (); j + +)

Sum + = Gsl_vector_get (q_vct,j) * Gsl_matrix_get (mtx,j,i);

Gsl_vector_set (d_vct,i,sum);

}

Last Request (XQ ' T) s^-1

for (int k = 0; k < LSD; k++)

Gsl_vector_set (D_VCT, K,

Gsl_vector_get (d_vct,k) * (1/gsl_vector_get (s_vct,k)));

Compare each document in the document collection with DQ

for (int l=0;l<doclist.size (); l++)

{

......

Find two vector angles, return cosine value

Relation = Comparevector (D_vct, TEMP_D_VCT, LSD);

}

}

[/code]

5. Testing

Let's start with a previously discussed set of documents

Save C1~m4 to 9 files and put them in the Corpus folder

Run the program with the input format lsa.out [query]

./lsa.out Human Computer Interaction

You can see that the document most relevant to the topic is C3, followed by C1. The C1~c5 file is the same subject document, subject to human-machine interaction, while the common theme of M1~M4 is computer graphics. and the query "Human Computer Interaction" clearly describes the human-machine interaction. As a result, you can see that the correlation of C1~C5 is all higher than the M1~M4 document. Finally, observe the C3,C5 documents, which do not contain any of the words in the query, and the calculated similarity is not 0, and the similarity of C3 is 0.999658, which is the effect of LSA's latent semantics.

The following is a table of results from document 22 comparison (imported into Excel)

1~9 and A~b each represent document {C1,C2,C3,C4,C5,M1,M2,M3,M4}

The relationship of the document is clearly displayed:

First look at [1~5][a~e] is the 1th to 5th line,

A~e column, because document C1~C5 is a subject document, you can see that [1~5][a~e] is greater than 0.9, and [1~5][f~i] is not more than 0.5, also indicates that the C1~C5 document is irrelevant to the M1~M4 document topic.

Similarly, [6~9][f~i] can be analyzed.

The above discussion shows that latent semantic analysis has obvious effect on topic classification. If you set a threshold for a category, such as 0.8, then the above 9 documents are automatically divided into {c1,c2,c3,c4,c5} and {M1,M2,M3,M4}

In another test, I collected 6 themes from the new York Times website, 5 articles per topic

The results of the search for "What a great Day" are as follows:

Pseudo-Text coordinates (0.00402821,-0.0183549,0.00361756), the relative degree of each document. If the set retrieval threshold is 0.9, then document MOVIE2,SPORT4,ART2 will be returned.

7 Summary

LSA improves the accuracy of information retrieval by modeling the latent semantic space. Later, pLSA (probabilistic latent semantic analysis) and LDA (latent Dirichlet allocation) were proposed, and the idea of LSA was brought into the probabilistic statistical model.

The LSA's problem of Polysemy is still unresolved, and only one word is solved. Because the LSA represents each word as a point in the latent semantic space, multiple meanings of a word are in space for a point and are not differentiated.

8 References

[1] deerwester, S., Dumais, S. T., Furnas, G. W., Landauer, T. K., & Harshman, R. (1990). Indexing by latent the Semantic analysis. Journal of the American Society for information, 41, 391-407. 10

[2] Christopher D. Manning, Prabhakar Raghavan and Hinrich Schütze, Introduction to information retrieval, Cambridge unive Rsity Press. 2008.

[3] Thomas Landauer, P. W. Foltz, & D. Laham (199. "Introduction to latent-Semantic analysis". Discourse Processes 25:259–284.

[4] Michael Berry, s.t Dumais, G.W. O ' Brien (1995). Using Linear Algebra for Intelligent information retrieval. Illustration of the application of LSA to document retrieval.

[5] http://www.gnu.org/software/gsl/manual/html_node/

[6] http://tartarus.org/~martin/PorterStemmer/

[7] Http://en.wikipedia.org/wiki/TF_IDF

9 External Link

[1] http://code.google.com/p/lsa-lda/

Code implementation of the program in this article and LSA related data

[2] Http://en.wikipedia.org/wiki/Latent_semantic_analysis

An LSA Wiki entry with an overview of the LSA

[3] http://lsa.colorado.edu/

An LSA project at the University of Colorado that provides LSA-based terms comparisons, text comparisons, etc.

[4] http://www.bluebit.gr/matrix-calculator/

On-line matrix calculation tool to calculate SVD

Ten further Reading

[1] Thomas Hofmann, probabilistic latent Semantic indexing, Proceedings of the Twenty-second annual international Sigir Co Nference on the development in Information retrieval (SIGIR-99), 1999

[2] Blei, David M.; Ng, Andrew Y.; Jordan, Michael I (January 2003). "Latent Dirichlet allocation". Journal of machine learning-3:pp. 993–1022. doi:10.1162/jmlr.2003.3.4-5.993 (inactive 2009-03-30).

Posted in information retrieval, machine learning | Tagged Latent Semantic analysis, LSA, information retrieval, machine learning, Latent semantic Analytics | Leave a comment Turn http://blog.csdn.net/wangran51/article/details/7408406

Latent Semantic Analysis Note (LSA)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.