Plsi (Probabilistic Latent Semantic Indexing) Word classification, document classification

Source: Internet
Author: User
After grinding me for a week, I watched it intermittently and thought about it intermittently. When I went to the bathroom in the morning, I finally figured it out. Maybe I have poor understanding of English. I found it quite simple.

Lsa should be familiar with IR and NLP. lsa uses SVD to reduce the dimension and then classifies documents based on word distribution.

The disadvantage of LSA is that there is no good statistical basis, which is inconsistent with the current popular trend.

Therefore, plsa uses probability models for document classification or word clustering. And so on

You are required to have a word dictionary. Assume It is setword = {W1, W2, w3 ..}

Several categories defined in advance, such as settopic = {T1, T2, T3 ...}

The last is a collection of documents with unknown categories. setdoc = {D1, D2, D3 ...}

Imagine the idea of an author writing a document.

1. Determine the topic to write, p (t)

2. Select a series of words that are related to the current topic T, so P (w | T ).

3. Use these phrases to form a document, P (d | W ).

This is the Shunde idea.

When we get a bunch of documents. We need to think about it in turn. This is the three steps of plsi.

1. Select the probability of a document D in the document set, P (d)

2. This document describes the probability of topic-T: P (t | D)

3. This topic contains the probability of w (w | T) in the current content of the document)

Of course, we can see that the last one should have been P (w | T, d), which is the assumption of plsi: the words in the document are irrelevant to a specific document. So P (w | T, d) = P (w | T)

Therefore, it is an unsupervised learning classification process.

P (D, W) = P (d) P (w | D)

P (w | D) = Σ p (w | T) P (t | D) (T ε t)

Merge two equations.

P (D, W) = P (d) * Σ p (w | T) P (t | D) = Σ p (w | T) P (t | D) P (d)

P (t | D) P (d) = p (t, d) = P (d | T) p (t)

P (D, W) = P (d) * Σ p (w | T) P (t | D) = Σ p (w | T) P (d | T) P (t) (T ε t)

The result is P (w | T) and P (d | T )..

E-STEP: P (t | D, W) = P (w | T) P (d | T) p (t)/Σ (P (w | T ') P (d | T') P (t '))

M-STEP:

P (w | T) = Σ (N (D, W) * P (t | D, W )) [-calculate all values of D]/Σ (N (D, W) * P (t | D, W) [-for all values of D, computing for all sets]

Similarly, P (d | T) = Σ (N (D, W) * P (t | D, W )) [-calculate all W values]/Σ (N (D, W) * P (t | D, W) [-for all fixed D values, change all calculation]

P (z) = Σ (N (D, W) P (z | W, D), all the statistics for Z/Σ N (D, W) (All documents, all categories are comprehensive.

Intuitively, plsi outputs two matrices and a vector.

Matrix:

P (w | T) defines the distribution of a word under a topic.

P (d | T) defines the distribution of each document under this topic.

Unfortunately, the current topic is too Bt. I think plsi is only suitable for the public, and the frequent word clustering effect is good. For some ancient articles, it cannot be done because there is no data or documents. Sigh and continue. I did my research and development, and the pressure was great, great, and great. I want to graduate.

If there is an error, please point out, Thank you,

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.