LDA source code analysis (MatLab)

Source: Internet
Author: User
Tags svm

LDA stands for latent Dirichlet allocation. For more information about Lda, see Wikipedia. Here we will explain the LDA source code analysis (MatLab)
Original code Author: Daichi mochihashi
Source code: http://download.csdn.net/detail/nuptboyzhb/5305145
1. lda source code execution in the MATLAB environment
1. Environment Configuration
Switch the working directory of MATLAB to the directory where the code is located
2. Call the main function
> [Alpha, beta] = ldamain ('train', 20); % training data file train
20 categories
Ii. Training data train Data Format
For example:
<Feature_id>: <count> feature label: number of features
For a document, the feature ID represents a word, and the number indicates the number of times a word appears.
Each line in train represents a document, as shown below:

6

Note: The train data format in Lda is different from that in SVM. In SVM, the training data format is as follows:
It is similar but different from the SVM training data format.
<Label> <index1 >:< value1> <index2 >:< value2>...
That is to say, training data in SVM has "tags ". SVM is a process of supervised learning. Lda is a non-supervised learning method.
Iii. Significance of variables in the code

N integer, indicating the document number l integer, indicating the number of words beta two-dimensional array, the row representing the word, the column representing the topic, the matrix unit representing a topic to generate a probability Alpha array of a word, the K integer corresponding to the Dirichlet distribution parameter represents the number of topics. This is a user-defined Gamma One-dimensional array. In variational reasoning, the Gammas parameter of the Dirichlet distribution is a one-dimensional array with sufficient statistics, the form is the same as gamma, which is used to estimate the Alpha value in M-step Q two-dimensional array. The row represents the word in the document, and the column represents the topic, A matrix unit represents a topic in the document to generate the probability Betas full statistics for a word. It is a two-dimensional array in the same form as Q. This variable is used for e-step statistics for M-step beta estimation.

Iv. lda source code Overview

Although the LDA Source Code contains many. M files, the LDA. m vbem. M and newton_alpha.m files are the main files.
5. Core code analysis

Lda. m

Function [alpha, beta] = LDA (D, K, emmax, demmax) % latent Dirichlet allocation, standard model. % d: Data of documents % K: # of classes to assume % emmax: # of maximum VB-EM iteration (default 100) % demmax: # Of maximum VB-EM Iteration for a document (default 20) If nargin <4 demmax = 20; If nargin <3 emmax = 100; endendn = length (d ); L = features (d); % initialize Beta = mnormalize (RAND (L, k), 1); alpha = normali Ze (fliplr (sort (RAND (1, k); % Gammas = zeros (n, k); ppl = 0; pppl = ppl; TIC; fprintf (1, 'number of documents = % d \ n', n); fprintf (1, 'number of words = % d \ n', L ); fprintf (1, 'number of latent classes = % d \ n', k); for j = 1: emmax fprintf (1, 'iteration % d/% d .. \ t', J, emmax); % Vb-estep input alpha and beta computing Gammas Betas = zeros (L, k); for I = 1: N % calculate each document [gamma, q] = vbem (d {I}, beta, Alpha, demmax); Gamma S (I, :) = gamma; % Save the value of each document Betas = accum_beta (Betas, Q, d {I}); end % Vb-mstep Maximum Likelihood function (Gammas ), solve alpha and beta alpha = newton_alpha (Gammas); Beta = mnormalize (Betas, 1); % converge? PPL = lda_ppl (D, beta, Gammas); fprintf (1, 'ppl = % G \ t', PPL); If (j> 1) & converged (PPL, pppl, 1.0e-4) if (j <5) fprintf (1, '\ n'); % too few iterations try again! [Alpha, beta] = LDA (D, K, emmax, demmax); return; end fprintf (1, '\ nconverged. \ n'); return; end pppl = ppl; % ETA elapsed = toc; fprintf (1, 'ETA: % s (% d SEC/step) \ R ',... rtime (elapsed * (emmax/J-1), round (elapsed/J); endfprintf (1, '\ n ');

Articles cannot be used for commercial purposes without permission !!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.