The model of text subject LDA (i) LDA Foundation
The model of the text subject LDA (ii) The Gibbs sampling algorithm for LDA solution
LDA of the text subject model (iii) The variational inference EM algorithm for LDA solution
This article is the third part of the LDA thematic model, which reads the LDA (a) LDA foundation of the text topic model prior to reading it, and because the EM algorithm is used, if you are unfamiliar with EM algorithm, it is recommended to familiarize yourself with the main idea of the EM algorithm first. LDA's variational inference EM algorithm is applied to the LDA algorithm implemented by Spark Mllib and Scikit-learn and is therefore worth a good understanding.
1. variational inference the EM algorithm solves LDA's idea
First, the model diagram of LDA is reviewed as follows:
The variational inference EM algorithm hopes to obtain the document subject distribution and the keyword distribution of LDA model through "variational inference (variational inference)" and EM algorithm. First of all, the EM algorithm is used here, our model has hidden variables $\theta,\beta, z$, the model parameters are $\alpha,\eta$. In order to find out the model parameters and corresponding hidden variable distributions, the EM algorithm needs to find out the hidden variable $\theta,\beta in e step, z$ the expectation based on the conditional probability distribution, then the expectation of M-step maximization, and obtains the updated Posteriori model parameter $\alpha,\eta$.
The problem is that in the EM algorithm e-step, due to $\theta,\beta, z$ coupling, we can not find the hidden variable $\theta,\beta, z$ conditional probability distribution, it is difficult to find the corresponding expectations, need "variational inference" to help, here so-called variational inference, In the case of coupling of hidden variables, we assume that all hidden variables are formed by separate distributions, thus removing the coupling relationship between hidden variables. We use the variational distributions formed by each independent distribution to simulate the condition distributions of the approximate hidden variables, so that the EM algorithm can be used smoothly.
After the iterative updating of the e-step and M-step of several rounds, we can get the suitable approximate hidden variable distribution $\theta,\beta, z$ and the $\alpha,\eta$ of the model posteriori parameters, and then we get the LDA document topic distribution and the keyword distribution we need.
It is shown that to fully understand LDA's variational inference em algorithm, it is necessary to clarify the process of it in the course of the e-step variational inference and the EM algorithm after the inference is complete.
2. LDA's variational inference approach
To use the EM algorithm, we need to find the conditional probability distribution of the hidden variable as follows: $ $p (\theta,\beta, z | w, \alpha, \eta) = \frac{p (\theta,\beta, Z, w| \alpha, \eta)}{p (W|\alph A, \eta)}$$
As mentioned above, because of the coupling between the $\theta,\beta and z$, the probability of this condition cannot be directly obtained, but it cannot be used without the EM algorithm. What to do, we introduce variational inference, which is to introduce a variational inference based on the mean field assumption, which assumes that all hidden variables are formed by their separate distributions, as shown in:
We assume that the hidden variable $\theta$ is formed by the independent distribution $\gamma$, the hidden variable $z$ is formed by the independent distribution $\phi$, and the hidden variable $\beta$ is formed by the independent distribution $\lambda$. So we got three variable distributions for the $q$ of the hidden variables: $$\begin{align} q (\beta, z, \theta|\lambda,\phi, \gamma) & = \PROD_{K=1}^KQ (\beta_k|\ Lambda_k) \prod_{d=1}^mq (\theta_d, z_d|\gamma_d,\phi_d) \ & = \PROD_{K=1}^KQ (\beta_k|\lambda_k) \prod_{d=1}^M (q ( \theta_d|\gamma_d) \prod_{n=1}^{n_d}q (z_{dn}| \phi_{dn})) \end{align}$$
Our goal is to approximate the estimated $p (\theta,\beta, z | w, \alpha, \eta) $ with $q (\beta, Z, \theta|\lambda,\phi, \gamma) $, which means that the two distributions need to be as similar as possible, In mathematical terms, it is hoped that the two probability distributions will have as little kl distance as possible, namely: $$ (\lambda^*,\phi^*, \gamma^*) = \underbrace{arg \;min}_{\lambda,\phi, \gamma} D (q (\beta, Z, \theta|\lambda,\phi, \gamma) | | P (\theta,\beta, Z | w, \alpha, \eta)) $$
of which $d (q| | P) $ is the KL divergence or KL distance, which corresponds to the cross entropy of the distribution $q$ and $p$. That is: $ $D (q| | p) = \sum\limits_{x}q (x) log\frac{q (x)}{p (x)} = E_{q (x)} (Log\;q (x)-log\;p(x)) $$
Our goal is to find the right $\lambda^*,\phi^*, \gamma^*$, and then use $q (\beta, Z, \theta|\lambda^*,\phi^*, \gamma^*) $ to approximate the condition distribution of the hidden variable $p (\theta, \beta, Z | W, \alpha, \eta) $, which in turn uses the EM algorithm iteration.
This suitable $\lambda^*,\phi^*, \gamma^*$, also not so good beg, how to do? Let's see what I can do. The log-likelihood function of the document data $log (W|\alpha,\eta) $ is as follows, in order to simplify the representation, we use $E_Q (x) $ instead of $e_{q (\beta, Z, \theta|\lambda,\phi, \gamma)} (x) $, Used to represent $x$ for variational distributions $q (\beta, Z, \theta|\lambda,\phi, \gamma) $.
$$\begin{align} log (W|\alpha,\eta) & = Log \int\int \sum\limits_z p (\theta,\beta, Z, w| \alpha, \eta) D\theta D\beta \ & = Log \int\int \sum\limits_z \frac{p (\theta,\beta, Z, w| \alpha, \eta) q (\beta, z, \theta|\lambda,\phi, \gamma)} {Q (\beta, z, \theta|\lambda,\phi, \gamma)}d\theta d\beta \ & = log\; E_q \frac{p (\theta,\beta, Z, w| \alpha, \eta)}{q (\beta, Z, \theta|\lambda,\phi, \gamma)} \ \ & \geq e_q\; Log\frac{p (\theta,\beta, Z, w| \alpha, \eta)}{q (\beta, Z, \theta|\lambda,\phi, \gamma)} \ \ & = e_q\; Log{p (\theta,\beta, Z, w| \alpha, \eta)}-e_q\; Log{q (\beta, Z, \theta|\lambda,\phi, \gamma)} \end{align} $$
The Jensen inequalities are used from the form (5) to the first (6): $ $f (E (x)) \geq E (f (x)) \;\; F (x) is a concave function $$
We generally write the paragraph (7) as: $ $L (\lambda,\phi, \gamma; \alpha, \eta) = e_q\; Log{p (\theta,\beta, Z, w| \alpha, \eta)}-e_q\; Log{q (\beta, Z, \theta|\lambda,\phi, \gamma)}$$
Since $ L (\lambda,\phi, \gamma; \alpha, \eta) $ is a nether (6th type) of our logarithmic likelihood, this $l$ is generally called Elbo (Evidence Lower BOund). So what does this elbo have to do with the KL divergence we need to optimize? Note: $$\begin{align} D (q (\beta, z, \theta|\lambda,\phi, \gamma) | | p (\theta,\beta, Z | w, \alpha, \eta)) & = E_q Logq (\ Beta, z, \theta|\lambda,\phi, \gamma)-e_q log P (\theta,\beta, Z | w, \alpha, \eta) \\& =e_q logq (\beta, Z, \theta|\ Lambda,\phi, \gamma)-e_q log \frac{p (\theta,\beta, Z, w| \alpha, \eta)}{p (W|\alpha, \eta)} \\& =-L (\lambda,\phi, \gamma; \alpha, \eta) + log (W|\alpha,\eta) \end{align} $$
In the (10) formula, because the logarithmic likelihood portion is independent of our KL divergence, it can be considered a constant, so we want to minimize the KL divergence equivalent to maximizing Elbo. Then our variational inference is ultimately equivalent to the maximum value required for the Elbo. Now we begin to focus on the maximal Elbo and find the variational parameter $\lambda,\phi of the extremum corresponding to the \gamma$.
3. Maximal elbo to solve variational parameters
In order to greatly elbo, we first make a collation of the Elbo function as follows: $$\begin{align} L (\lambda,\phi, \gamma; \alpha, \eta) & = E_q[logp (\beta|\eta)] + e_q[ Logp (Z|\theta)] + E_Q[LOGP (\theta|\alpha)] \ \ & + E_Q[LOGP (W|z, \beta)]-e_q[logq (\BETA|\LAMBDA)] \ \ &-E_q[lo GQ (Z|\phi)]-e_q[logq (\theta|\gamma)] \end{align} $$
There are 7 items visible, and now we need to do a separate expansion of these 7 items. In order to simplify the space, only the first item is described in detail here. Before we introduce the first item, we need to understand the nature of the exponential distribution family. The exponential distribution family refers to the following probability distributions: $ $p (X|\theta) = h (x) exp (\eta (\theta) *t (x)-A (\theta)) $$
Wherein, $A (x) $ for the normalization factor, mainly to ensure that the probability distribution cumulative sum of 1, the introduction of exponential distribution family is mainly it has the following properties: $$\frac{d}{d \eta (\theta)} A (\theta) = E_{p (X|\theta)}[t (x)]$ $
This proves not to be complicated, and not to be described here. Our common distributions such as gamma distribution, beta distribution, and Dirichlet distribution are all exponential distribution families. With this nature, it means that we have a big push in the elbo of the expected expression can be converted to a derivative to complete, this technique greatly simplifies the computational amount.
Back to our Elbo the first item is expanded as follows: $$\begin{align} E_Q[LOGP (\beta|\eta)] & = E_q[log (\frac{\gamma (\sum\limits_{i=1}^v\eta_i)}{\ Prod_{i=1}^v\gamma (\eta_i)}\prod_{i=1}^v\beta_{i}^{\eta_i-1})] \ \ & = Klog\gamma (\sum\limits_{i=1}^v\eta_i)-K\ Sum\limits_{i=1}^vlog\gamma (\eta_i) + \sum\limits_{k=1}^ke_q[\sum\limits_{i=1}^v (\eta_i-1) Log\beta_{ki}] \end{ Align} $$
The expected part of the third item of paragraph (15) can be transformed into a derivative process using the properties of the exponential distribution family mentioned above. i.e.: $ $E _q[\sum\limits_{i=1}^vlog\beta_{ki}] = (Log\gamma (\lambda_{ki})-Log\gamma (\sum\limits_{i^{'}=1}^V\lambda_{ Ki^{'})) ^{'} = \psi (\lambda_{ki})-\psi (\sum\limits_{i^{'}=1}^v\lambda_{ki^{'}}) $$
Where: $$\psi (x) = \frac{d}{d X}log\gamma (x) = \frac{\gamma^{'} (x)}{\gamma (x)}$$
In the end, we get Eblo the first item is expanded to: $$\begin{align} e_q[logp (\beta|\eta)] & = Klog\gamma (\sum\limits_{i=1}^v\eta_i)-k\sum\ Limits_{i=1}^vlog\gamma (\eta_i) + \sum\limits_{k=1}^k\sum\limits_{i=1}^v (\eta_i-1) (\Psi (\lambda_{ki})-\Psi (\sum\ limits_{i^{'}=1}^v\lambda_{ki^{'})) \end{align} $$
The similar method solves the other 6 items, can obtain the Elbo finally about the variational parameter $\lambda,\phi, the \gamma$ expression. 6 Other expressions are: $$\begin{align} e_q[logp (Z|\theta)] = \sum\limits_{n=1}^n\sum\limits_{k=1}^k\phi_{nk}\psi (\gamma_{k})-\ Psi (\sum\limits_{k^{'}=1}^k\gamma_{k^{'}) \end{align} $$
$$\begin{align} e_q[logp (\theta|\alpha)] & = Log\gamma (\sum\limits_{k=1}^k\alpha_k)-\sum\limits_{k=1}^klog\ Gamma (\alpha_k) + \sum\limits_{k=1}^k (\alpha_k-1) (\psi (\gamma_{k})-\psi (\sum\limits_{k^{'}=1}^k\gamma_{k^{'}}) \ End{align} $$
$$\begin{align} e_q[logp (W|z, \beta)] & = \sum\limits_{n=1}^n\sum\limits_{k=1}^k\sum\limits_{i=1}^v\phi_{nk}w_n ^i (\psi (\lambda_{ki})-\psi (\sum\limits_{i^{'}=1}^v\lambda_{ki^{'}}) \end{align} $$
$$\begin{align} e_q[logq (\BETA|\LAMBDA)] = \sum\limits_{k=1}^k (Log\gamma (\sum\limits_{i=1}^v\lambda_{ki})-\sum\ Limits_{i=1}^vlog\gamma (\lambda_{ki})) + \sum\limits_{k=1}^k\sum\limits_{i=1}^v (\lambda_{ki}-1) (\Psi (\lambda_{ki })-\psi (\sum\limits_{i^{'}=1}^v\lambda_{ki^{'})) \end{align} $$
$$\begin{align} e_q[logq (Z|\phi)] & = \sum\limits_{n=1}^n\sum\limits_{k=1}^k\phi_{nk}log\phi_{nk} \end{align} $$
$$\begin{align} e_q[logq (\theta|\gamma)] & = Log\gamma (\sum\limits_{k=1}^k\gamma_k)-\sum\limits_{k=1}^klog\ Gamma (\gamma_k) + \sum\limits_{k=1}^k (\gamma_k-1) (\psi (\gamma_{k})-\psi (\sum\limits_{k^{'}=1}^k\gamma_{k^{'}}) \ End{align} $$
With Elbo's specific expression about variational parameters $\lambda,\phi, \gamma$, we can iteratively update the variational parameters and model parameters with the EM algorithm.
4. EM algorithm e-step: Obtain the optimal variational parameters
With the ELBO function based on the previous variational inference, we can do the EM algorithm. But unlike the EM algorithm, the e-step here needs to calculate the best variational parameters in the eblo containing the desired. How to solve the best variational parameters? By the Elbo function to each variational parameter $\lambda,\phi, \gamma$ respectively derivative and the partial derivative is 0, you can get the iterative representation of the REACH, the convergence of multiple iterations is the best variational parameters.
In this case, the expression of each variational parameter is deduced as follows:
$$\begin{align} \phi_{nk} & \propto exp (\sum\limits_{i=1}^vw_n^i (\PSI (\lambda_{ki})-\psi (\sum\limits_{i^{'}=1} ^v\lambda_{ki^{'})) + \psi (\gamma_{k})-\psi (\sum\limits_{k^{'}=1}^k\gamma_{k^{'}}) \end{align} $$
Among them, $w _n^i =1$ when and only if the $n$ word in the document is the first $i$ word in the glossary.
$$\begin{align} \gamma_k & = \alpha_k + \sum\limits_{n=1}^n\phi_{nk} \end{align} $$
$$\begin{align} \lambda_{ki} & = \eta_i + \sum\limits_{n=1}^n\phi_{nk}w_n^i \end{align} $$
Because the variational parameter $\lambda$ determines the distribution of the $\beta$, it is common for the whole corpus, so we have:
$$\begin{align} \lambda_{ki} & = \eta_i + \sum\limits_{d=1}^m\sum\limits_{n=1}^{n_d}\phi_{dnk}w_{dn}^i \end{align } $$
Finally, our e-step is to update the three variational parameters using (23) (24) (26). When we get three variational parameters, we continually iterate over the update until the three variational parameters converge. When the variational parameters converge, the next step is m-step, fixed variational parameters, and updated model parameters $\alpha,\eta$.
5. M-Step of EM algorithm: Update model parameters
Since we have obtained the current optimal variational parameters in e-step, we can now fix the variational parameters in M-step, and maximize the Elbo to get the optimal model parameter $\alpha,\eta$. There are many methods to solve the optimal model parameter $\alpha,\eta$, and the gradient descent method can be used in Newton method. LDA here generally uses Newton's method, that is, by finding the expression of the first derivative and second derivative of the Elbo for $\alpha,\eta$, and then iteratively solving the optimal solution of the $\alpha,\eta$ in M-step.
For $\alpha$, the expression of its first derivative and second derivative is: $$\nabla_{\alpha_k}l = M (\psi (\sum\limits_{k^{'}=1}^k\alpha_{k^{'}})-\psi (\alpha_{k} ) + \sum\limits_{d=1}^m (\psi^{'} (\GAMMA_{DK})-\psi^{'} (\sum\limits_{k^{'}=1}^k\gamma_{dk^{'}})) $$
$$\nabla_{\alpha_k\alpha_j}l = M (\psi^{'} (\sum\limits_{k^{'}=1}^k\alpha_{k^{'}})-\delta (k,j) \Psi^{'} (\alpha_{k}) )$$
Where, when and only when $k=j$, $\delta (k,j) =1$, otherwise $\delta (k,j) =0$.
For $\eta$, the expression of its first derivative and second derivative is: $$\nabla_{\eta_i}l = K (\psi (\sum\limits_{i^{'}=1}^v\eta_{i^{'})-\psi (\eta_{i}) + \sum\ Limits_{k=1}^k (\psi^{'} (\lambda_{ki})-\psi^{'} (\sum\limits_{i^{'}=1}^v\lambda_{ki^{'}})) $$
$$\nabla_{\eta_i\eta_j}l = K (\psi^{'} (\sum\limits_{i^{'}=1}^v\eta_{i^{'}})-\delta (i,j) \psi^{'} (\eta_{i})) $$
Where, when and only when $i=j$, $\delta (i,j) =1$, otherwise $\delta (i,j) =0$.
The final Newton method iterative formula is: $$\begin{align} \alpha_{k+1} = \alpha_k + \frac{\nabla_{\alpha_k}l}{\nabla_{\alpha_k\alpha_j}l} \end{ align}$$
$$\begin{align} \eta_{i+1} = \eta_i+ \frac{\nabla_{\eta_i}l}{\nabla_{\eta_i\eta_j}l} \end{align}$$
6. LDA variational inference em Algorithm Flow summary
The following summarizes the overview flow of the LDA variational inference em algorithm.
Input: The number of topics $k$,m documents and corresponding words.
1) Initialize the $\alpha,\eta$ vector.
2) Start the EM algorithm iteration loop until it converges.
A) Initialize all $\phi, \gamma, \lambda$, and the E-step iterative loop of LDA, until $\lambda,\phi, \gamma$ converge.
(i) for D from 1 to M:
For n from 1 to $N _d$:
For k from 1 to K:
Update $\phi_{nk}$ as per (23)
The normalized $\phi_{nk}$ makes the and of the vector items 1.
Update $\gamma_{k}$ as per (24).
(ii) for K from 1 to K:
For I from 1 to V:
Update $\lambda{ki}$ as per (26).
(iii) If the $\phi, \gamma, \lambda$ are convergent, then step out of a), or return to (i) step.
b) The M-Step iterative cycle of LDA until the $\alpha,\eta$ converges
(i) Using Newton method to update $\alpha,\eta$ until convergence according to (27) (28)
c) If all parameters are convergent, then the algorithm ends, or returns to the 2nd step.
After the algorithm is finished, we can get the model's posteriori parameter $\alpha,\eta$, and we need the approximate model keyword distribution $\lambda$, as well as the approximate training document topic distribution $\theta$.
(Welcome reprint, reproduced please indicate the source.) Welcome to communicate: [email protected])
LDA of the text subject model (iii) The variational inference EM algorithm for LDA solution