After grinding me for a week, I watched it intermittently and thought about it intermittently. When I went to the bathroom in the morning, I finally figured it out. Maybe I have poor understanding of English. I found it quite simple.
Lsa should be familiar with IR and NLP. lsa uses SVD to reduce the dimension and then classifies documents based on word distribution.
The disadvantage of LSA is that there is no good statistical basis, which is inconsistent with the current popular trend.
Therefore, plsa uses probability models for document classification or word clustering. And so on
You are required to have a word dictionary. Assume It is setword = {W1, W2, w3 ..}
Several categories defined in advance, such as settopic = {T1, T2, T3 ...}
The last is a collection of documents with unknown categories. setdoc = {D1, D2, D3 ...}
Imagine the idea of an author writing a document.
1. Determine the topic to write, p (t)
2. Select a series of words that are related to the current topic T, so P (w | T ).
3. Use these phrases to form a document, P (d | W ).
This is the Shunde idea.
When we get a bunch of documents. We need to think about it in turn. This is the three steps of plsi.
1. Select the probability of a document D in the document set, P (d)
2. This document describes the probability of topic-T: P (t | D)
3. This topic contains the probability of w (w | T) in the current content of the document)
Of course, we can see that the last one should have been P (w | T, d), which is the assumption of plsi: the words in the document are irrelevant to a specific document. So P (w | T, d) = P (w | T)
Therefore, it is an unsupervised learning classification process.
P (D, W) = P (d) P (w | D)
P (w | D) = Σ p (w | T) P (t | D) (T ε t)
Merge two equations.
P (D, W) = P (d) * Σ p (w | T) P (t | D) = Σ p (w | T) P (t | D) P (d)
P (t | D) P (d) = p (t, d) = P (d | T) p (t)
P (D, W) = P (d) * Σ p (w | T) P (t | D) = Σ p (w | T) P (d | T) P (t) (T ε t)
The result is P (w | T) and P (d | T )..
E-STEP: P (t | D, W) = P (w | T) P (d | T) p (t)/Σ (P (w | T ') P (d | T') P (t '))
M-STEP:
P (w | T) = Σ (N (D, W) * P (t | D, W )) [-calculate all values of D]/Σ (N (D, W) * P (t | D, W) [-for all values of D, computing for all sets]
Similarly, P (d | T) = Σ (N (D, W) * P (t | D, W )) [-calculate all W values]/Σ (N (D, W) * P (t | D, W) [-for all fixed D values, change all calculation]
P (z) = Σ (N (D, W) P (z | W, D), all the statistics for Z/Σ N (D, W) (All documents, all categories are comprehensive.
Intuitively, plsi outputs two matrices and a vector.
Matrix:
P (w | T) defines the distribution of a word under a topic.
P (d | T) defines the distribution of each document under this topic.
Unfortunately, the current topic is too Bt. I think plsi is only suitable for the public, and the frequent word clustering effect is good. For some ancient articles, it cannot be done because there is no data or documents. Sigh and continue. I did my research and development, and the pressure was great, great, and great. I want to graduate.
If there is an error, please point out, Thank you,