Probabilistic language model and its deformation series (1)-plsa and EM algorithm _em algorithm

Source: Internet
Author: User

This series of Bowen introduces the common probabilistic language model and its deformation model, mainly summarizes the deformation model and parameter inference method of pLSA, LDA and LDA. The preliminary plan is as follows:

First article: pLSA and EM algorithm

Second article: Lda and Gibbs samping

The third article: LDA deformation model-twitter Lda,timeuserlda,atm,labeled-lda,maxent-lda, etc.

Fourth: A summary of paper classification based on deformable LDA

Fifth: The Java implementation of LDA Gibbs sampling


The first article pLSA and EM algorithm

[PDF version of this article download address pLSA and EM algorithm-yangliuy]

This paper mainly introduces pLSA and EM algorithm, first gives the early method SVD of LSA (implicit semantic analysis), then introduces the PLSA model based on probability, and its parameter learning adopts EM algorithm. Then we analyze how to use EM algorithm to estimate the parameters of a simple mixture Unigram language model and Gaussian mixture model GMM, and finally summarize the general form of EM algorithm and the key points of its application. For the improved pLSA, the Hyperparameter LDA model and its Gibbs sampling parameter estimation method are introduced in the following article, Lda and Gibbs samping.


1 LSA and SVD

The purpose of the LSA (implicit semantic analysis) is to discover the implied semantic dimensions-that is, "Topic" or "Concept"-from the text. We know that in the space vector Model (VSM) of the document, a document is represented as a multidimensional vector with the probability of occurrence of a feature word, the advantage of which is that query and document can be converted into vector computing similarity in the same space, and different weights can be assigned to different words, in text retrieval, classification, The clustering problem has been widely applied in the Java implementation of newsgroup18828 text classifier based on Bayesian algorithm and KNN algorithm, and based on Kmeans algorithm, Mbsas algorithm and Dbscan algorithm for the Java implementation of newsgroup18828 text clustering the clustering algorithm in the article is mostly based on the vector space model. However, the vector space model does not have the ability to deal with polysemy and semantic multiple words, for example, synonyms are also expressed as independent one-dimensional, the cosine similarity of the calculated vector will underestimate the similarity of the user's expectation, and when a word term has multiple meanings, it always corresponds to the same dimension, so the result of calculation will overestimate the similarity of the user's expectation.


The introduction of the LSA method can mitigate similar problems. Based on SVD decomposition, we can construct a low rank approximation matrix of a primitive vector matrix, in which the word-item document matrix is decomposed by SVD.




where the word item (terms) is the row, the document (documents) makes a large matrix for the column. Set a total of T row D column, the element of the matrix is the TF-IDF value of the word item. Then the first k of the R diagonal element is retained (the largest k is retained), the R-k singular value of the rear smallest is set to 0, obtained; Finally, an approximate decomposition matrix is computed.




Is the best approximation in the sense of least squares. The rank does not exceed K because it contains a maximum of k non 0 elements. By using SVD decomposition approximation, we transform the original vector into a low dimensional implicit semantic space, and play the role of dimensionality reduction. Each singular value corresponds to the weight of each "semantic" dimension, placing the less important weight at 0, preserving only the most important dimension information, removing some information "Nosie", and thus obtaining a better representation of the document. A Java implementation that applies SVD decomposition dimensionality reduction to document clustering can be found in this article.

An example of SVD dimensionality reduction given in IIR is as follows:


On the left is the original matrix SVD decomposition, the right is only to retain the weight of the maximum 2 dimensions, the original matrix reduced to 2 dimensions after the situation.


2 pLSA

Although the LSA based on SVD has achieved some success, it lacks rigorous mathematical statistical basis, and SVD decomposition is time-consuming. Hofmann on Sigir ' 99, the PLSA model based on probability statistics is proposed, and the model parameters are studied with EM algorithm. The probability graph model of pLSA is as follows




where d represents the document, Z represents the implied category or theme, W is the observed word, the probability of the word appearing in the document, the probability of the word appearing in the document, the probability of the word appearing in the given topic. And each subject obeys the multinomial distribution on all terms, and each document obeys the multinomial distribution on all subjects. The entire document generation process is as follows:

(1) Select the document with the probability;

(2) Select the subject by probability;

(3) The probability of producing a word.

The data we can observe is the right one, but the implied variable. The joint distribution is




And the distribution corresponds to two groups of multinomial distributions, we need to estimate the parameters of these two sets of distributions. The detailed derivation process of estimating pLSA parameters using the EM algorithm is given below.


3 estimate parameters in pLSA by EM

(Note: This part is mainly referred to Tomas Hoffman, unsupervised Learning by probabilistic and latent analysis.)

As described in the text language model's parameter estimation-Maximum likelihood estimation, map and Bayesian estimation, the commonly used parameter estimation methods are MLE, map, Bayesian estimation and so on. But in pLSA, if we try to estimate parameters directly using MLE, we get the likelihood function.




This is the number of times term appears in the document. Note that this is a function about and, there are N*k + m*k variables (note here M represents the total number of term, the general literature used to express the V), if the direct derivative of these arguments, we will find that because the independent variables are included in the logarithm and in the solution of this equation is very difficult. Therefore, we use the EM algorithm for the parameter estimation of the probabilistic model which contains "implied variables" or "Missing data".


The steps of the EM algorithm are:

(1) E Step: To find the posteriori probability of the implied variable given the current estimated parameter condition.

(2) M step: Maximizing the expectation of the complete data logarithmic likelihood function, we get the new parameter value by using the posteriori probability of the implied variable calculated in E step.

Two-step iterations are performed until convergence.


Let's explain what incomplete data and complete data are. Zhai in a classic EM algorithm notes, when the original data of the likelihood function is very complex, we add some hidden variables to enhance our data, get "complete", and "complete data" of the likelihood function is simpler, easy to find the maximum value. As a result, the original data became "incomplete". We will see that we can maximize the likelihood function of "incomplete data" by maximizing the expectation of the "complete data" likelihood function, so as to obtain a simpler way of calculating the maximum value of the likelihood function.


For our pLSA parameter estimation problem, in e step, we use Bayesian formula to calculate the posterior probability of implied variable under current parameter value,




In this step, we assume that all and all are known, because the initial random assignment, the subsequent iteration of the process to take the previous round of the parameter values obtained in the M step.


In the M step, we maximize the expectation of the complete data logarithmic likelihood function. In pLSA, incomplete data is observed, the implied variable is the subject, then the complete data is the ternary group, whose expectation is




Note that this is known and takes the estimated value in the preceding e-step. Here we maximize the expectation, which is also a problem of extremum of multivariate function, which can be used by Lagrange multiplier method. The Lagrange multiplier method can transform the conditional extremum problem into the unconditional extremum problem, in the pLSA the objective function is that the constraint condition is




So we can write Lagrange functions.




This is a function about and, separately for its partial derivative, we can get




Note that this is done by multiplying and deforming both sides of the equation, and by using the 4 sets of equations above, we can solve the new parameter values estimated by maximizing expectation in M step.




The key to solving the equations is to find out first, in fact, only need to do a plus and operation can be converted to the coefficient of 1, after a good calculation.

Then, using the updated parameter value, we enter the E step to compute the posteriori probability of the implied variable given the current estimated parameter condition. Iterate so continuously until the termination condition is met.

Note that we still use the MLE of complete data in M-step, so if we want to add some prior knowledge into our model, we can use the map estimate in the M step. Just as the parameter estimation of the text language model-Maximum likelihood estimation, map and Bayesian estimation two distribution of coins in the experiment we added the "coin is generally homogeneous on both sides" of this transcendental. The estimated value of the calculated parameters will be more preduo counts of the prior parameters in the numerator denominator, and the other steps will be the same. You can refer to the notes of Mei Qiaozhu specifically.

pLSA implementation is not difficult, the Internet has a lot of implementation code.

For example, this PLSA EM algorithm is implemented Http://ezcodesample.com/plsaidiots/PLSAjava.txt

The main classes are as follows (author Andrew Polar)

The code is taken from:
//http://code.google.com/p/mltool4j/source/browse/trunk/src/edu/thu/mltool4j/ Topicmodel/plsa
//i noticed some difference with original Hofmann concept in computation of P (z). It 
is//always even and actually not involved, which makes this algorithm non-negative matrix 
//factoring and not P LSA.
Found and tested by Andrew Polar. 
My version can is found on semanticsearchart.com or ezcodesample.com

Class Probabilisticlsa {private DataSet DataSet = null;
    Private posting[][] Invertedindex = null; private int M =-1; Number of data private int V =-1; Number of words private int K =-1; Number of topics public Probabilisticlsa () {} public boolean Doplsa (String datafilename, int ntopics
        , int iters) {file datafile = new file (datafilename); if (datafile.exists ()) {if (This.dataset = new DataSet (datafile)) = = null) {Syst
                Em.out.println ("Doplsa, DataSet = null");
            return false; } this.
            M = This.dataset.size (); This.
            V = This.dataset.getFeatureNum (); This.
            
             K = Ntopics;
            Build Inverted index This.buildinvertedindex (this.dataset); Run EM algorithm this.
            EM (iters);
            
        return true; else {System.out.println ("PrObabilisticlsa (String datafilename), datafile: "+ datafilename +" doesn ' t exist ");
        return false; }//build the inverted index for m-step fast calculation. 
    Format://invertedindex[w][]: A unsorted list of document and position which Word w//occurs. @param ds the DataSet which to is Analysis @SuppressWarnings ("Unchecked") Private Boolean Buildinvertedindex (Dat Aset DS) {arraylist<posting>[] list = new Arraylist[this.
        V]; for (int k=0; k<this. V
        ++K) {List[k] = new arraylist<posting> (); for (int m = 0; m < this. M
         	m++) {Data d = Ds.getdataat (m);
         		for (int position = 0; position < d.size (); position++) {int w = d.getfeatureat (position). Dim;
        	Add Posting List[w].add (New Posting (M, position)); }//Convert to array this.invertedindex = new posting[this. v][]; for (int w = 0; W < this. V
        w++) {This.invertedindex[w] = List[w].toarray (new posting[0));
    return true; } private Boolean EM (int iters) {//P (z), size:k double[] Pz = new Double[this.

        K]; P (d|z), size:k x M double[][] pd_z = new Double[this. K][this.

        M]; P (w|z), size:k x V double[][] pw_z = new Double[this. K][this.

        V]; P (z|d,w), size:k x M x doc.size () double[][][] pz_dw = new Double[this. K][this.

         M][];

         L:log-likelihood value Double L =-1;
         Run EM algorithm this.init (Pz, Pd_z, Pw_z, PZ_DW); for (int it = 0; it < iters it++) {//E-step if (!this.
             Estep (Pz, Pd_z, Pw_z, Pz_dw)) {System.out.println ("EM, in E-step"); }//M-step if (!this. Mstep (Pz_dw, Pw_z, Pd_z, Pz)) {SYSTEM.OUT.PRintln ("EM, in M-step");
             L = Calcloglikelihood (Pz, Pd_z, pw_z);
         System.out.println ("[" + It + "]" + "\tlikelihood:" + L); //print result for (int m = 0; m < this. M
        	 m++) {Double norm = 0.0; for (int z = 0; z < this. K
             z++) {norm + = pd_z[z][m];
        	 } if (norm <= 0.0) norm = 1.0; for (int z = 0; z < this. K
             z++) {System.out.format ("%10.4f", pd_z[z][m]/norm);
        } System.out.println ();
    return false; } Private Boolean init (double[] Pz, double[][] pd_z, double[][] pw_z, double[][][] pz_dw) {//P (z), size : K Double zvalue = (double) 1/(double) this.
        K for (int z = 0; z < this. K
        z++) {Pz[z] = Zvalue; }//P (d|z), Size:k x M for (int z = 0; z < this. K z++) {Double Norm = 0.0; for (int m = 0; m < this. M
                m++) {Pd_z[z][m] = Math.random ();
            Norm + = pd_z[z][m]; for (int m = 0; m < this. M
            m++) {pd_z[z][m]/= norm; }//P (w|z), Size:k x V for (int z = 0; z < this. K
            z++) {Double norm = 0.0; for (int w = 0; W < this. V
                w++) {Pw_z[z][w] = Math.random ();
            Norm + = pw_z[z][w]; for (int w = 0; W < this. V
            w++) {pw_z[z][w]/= norm; }//P (z|d,w), size:k x M x Doc.size () for (int z = 0; z < this. K z++) {for (int m = 0; m < this. M
            m++) {Pz_dw[z][m] = new Double[this.dataset.getdataat (m). Size ()];
    return false; Private Boolean Estep (double[] Pz, DoubLe[][] pd_z, double[][] pw_z, double[][][] pz_dw) {for (int m = 0; m < this. M
            m++) {Data data = This.dataset.getDataAt (m);  for (int position = 0; position < data.size (); position++) {//Get Word (dimension) in current

                Position of document m int w = data.getfeatureat (position). Dim;
                Double norm = 0.0; for (int z = 0; z < this. K
                    z++) {Double val = pz[z] * pd_z[z][m] * pw_z[z][w];
                    Pz_dw[z][m][position] = val;
                Norm + = val; }//normalization for (int z = 0; z < this. K
                z++) {pz_dw[z][m][position]/= norm;
    }} return true;
        } Private Boolean Mstep (double[][][] pz_dw, double[][] pw_z, double[][] pd_z, double[] Pz) {//P (w|z) for (int z = 0; z &lT This. K
            z++) {Double norm = 0.0; for (int w = 0; W < this. V

                w++) {Double sum = 0.0;
                posting[] postings = this.invertedindex[w];
                    for (Posting posting:postings) {int m = posting.docid;
                    int position = Posting.pos;
                    Double n = this.dataset.getDataAt (m). Getfeatureat (position). Weight;
                sum + = n * pz_dw[z][m][position];
                } pw_z[z][w] = sum;
            Norm + = sum; }//normalization for (int w = 0; W < this. V
            w++) {pw_z[z][w]/= norm; }//P (d|z) for (int z = 0; z < this. K
            z++) {Double norm = 0.0; for (int m = 0; m < this. M
                m++) {Double sum = 0.0;
     Data d = this.dataset.getDataAt (m);           for (int position = 0; position < d.size (); position++) {Double n = d.getf
                    Eatureat (position). Weight;
                sum + = n * pz_dw[z][m][position];
                } pd_z[z][m] = sum;
            Norm + = sum; }//normalization for (int m = 0; m < this. M
            m++) {pd_z[z][m]/= norm;  }//this is definitely a bug//p (z) values are even, but they should don't be even double
        Norm = 0.0; for (int z = 0; z < this. K
            z++) {Double sum = 0.0; for (int m = 0; m < this. M
            m++) {sum + = pd_z[z][m];
            } pz[z] = sum;
       Norm + = sum; }//normalization for (int z = 0; z < this. K
        	z++) {pz[z]/= norm;  System.out.format ("%10.4f", Pz[z]);
  Here, can print to      }//system.out.println ();
    return true;
        Private double Calcloglikelihood (double[] Pz, double[][] pd_z, double[][] pw_z) {double L = 0.0; for (int m = 0; m < this. M
            m++) {Data d = this.dataset.getDataAt (m); for (int position = 0; position < d.size (); position++) {Feature f = d.getfeatureat (position)
                ;
                int w = F.dim;

                Double n = f.weight;
                Double sum = 0.0; for (int z = 0; z < this. K
                z++) {sum + = pz[z] * pd_z[z][m] * pw_z[z][w];
            L + + n * MATH.LOG10 (SUM);
    } return L;
		} public class pLSA {public static void main (string[] args) {Probabilisticlsa pLSA = new Probabilisticlsa (); The file is not used, the hard coded the data are used instead, but file name should be valid,//just replace the name by Something valid.
        Plsa.doplsa ("C:\\users\\apolar\\workspace\\plsa\\src\\data.txt", 2, 60);
    System.out.println ("End pLSA"); }
}

4 estimate parameters in a simple mixture Unigram language model by EM

In the parameter estimation of pLSA, we use the EM algorithm. The EM algorithm is often used to estimate the problem of parameter estimation including "missing data" or "implied variable" models. The two concepts are interrelated, when we have "implied variables" in our model, we think the raw data is "incomplete data" because the value of the implied variable cannot be observed; Conversely, when our data is incomplete, we can model "missing data" by adding implied variables.


To deepen the understanding of the EM algorithm, let's look at how the EM algorithm can be used to estimate the parameters of a simple mixed Unigram language model. Headquarters

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.