Python Implementation of lda model and python Implementation of lda model

Last Update:2015-08-30 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

LDA (Latent Dirichlet Allocation) is a document topic generation model. I recently read some documents and want to implement it using python. As for the knowledge of mathematical models, there is a lot of experience. Here we also provide a very detailed reference to the lda algorithm roaming guide.
This blog post only describes the python Implementation of the algorithm sampling method.
Open-source python-LDA

Lda model variable application and initialization

# Pseudocode # input: article set (after word segmentation), K (number of classes) Output: the lda model has been randomly assigned once to apply for several statistics for begin: p probability vector dimension: K nw word distribution dimension on the class: M * K where M is the total number of words in the collection of articles nwsum total number of words in each class dimension: K nd in each article, word count distribution dimension of each class: V * K, where V indicates the total number of articles. ndsum indicates the total number of words in each article. dimension: v z. Each word is assigned a class dimension: V * Number of words in each article theta article-> probability distribution dimension of classes: V * K phi class-> probability distribution dimension of words: K * M # initialize the random allocation class for x in number of articles: Count ndsum [Article id] [number of words] for y in the number of words in each article: randomly assign a class word to all words. The number of words distributed in this class + 1 the number of such words in this article + 1 the total number of such words + 1 end

# Code snippets. For more details, refer to the github Project # class LDAModel (object): def _ init _ (self, dpre): self. dpre = dpre # obtain preprocessing parameters # model parameters # number of clusters K, number of iterations iter_times, number of feature words per class top_words_num, hyperparameter alpha (alpha) β (beta) # self. K = K self. beta = beta self. alpha = alpha self. iter_times = iter_times self. top_words_num = top_words_num # file variable # file trainfile with good words # word corresponds to the id file wordidmapfile # Article-topic distribution file thetafile # Word-topic distribution file phifile # topNfile # final assignment result file tassginfile # parameter file selected for model training paramfile # self. wordidmapfile = wordidmapfile self. trainfile = trainfile self. thetafile = thetafile self. phifile = phifile self. topNfile = topNfile self. tassginfile = tassginfile self. paramfile = paramfile # p, probability Vector double type, storing the temporary variables for sampling # nw, word distribution on topic # nwsum, total number of words per topic # nd, total number of words in each topic in each doc # ndsum, total number of words in each doc self. p = np. zeros (self. k) self. nw = np. zeros (self. dpre. words_count, self. k), dtype = "int") self. nwsum = np. zeros (self. k, dtype = "int") self. nd = np.zeros(self.dpre.doc s_count, self. k), dtype = "int") self. ndsum = np.zeros(dpre.doc s_count, dtype = "int") self. Z = np. array ([[0 for y in xrange(dpre.doc s [x]. length)] for x in xrange(dpre.doc s_count)]) # M * doc. size (), topic distribution of words in the Document # random first allocation type for x in xrange (len (self. z): self. ndsum [x] = self.dpre.doc s [x]. length for y in xrange(self.dpre.doc s [x]. length): topic = random. randint (0, self. k-1) self. Z [x] [y] = topic self.nw[self.dpre.doc s [x]. words [y] [topic] + = 1 self. nd [x] [topic] + = 1 self. nwsum [topic] + = 1 self. theta = np. array ([[0.0 for y in xrange (self. k)] for x in xrange(self.dpre.doc s_count)]) self. phi = np. array ([[0.0 for y in xrange (self. dpre. words_count)] for x in xrange (self. k)])

Sampling process

# Pseudo code # input: initialized lda_model, number of iterations iter_times, hyperparameter α, β, number of clusters K output: theta (distribution probability of the corresponding class in the article ), phi (Class-corresponding word distribution probability), tassgin (assignment class result of each word in the article), and twords (Top N high-frequency words of each class) begin for I in iterations: for m in number of articles: for v in words: take the statistics of topic = Z [m] [v] to make nw [v] [topic], nwsum [topic], and nd [m] [topic]-1 to calculate the probability p [] # p [] the probability that the word belongs to each topic for k in (1, number of classes-1): p [k] + = p [k-1] and then randomly assigned, the statistics of the new topic that is recorded include nw [v] [new_topic], nwsum [new_topic], and nd [m] [new_topic] + 1 # output model end after iteration is complete

# Code snippet def sampling (self, I, j): topic = self. Z [I] [j] word = self.dpre.doc s [I]. words [j] self. nw [word] [topic]-= 1 self. nd [I] [topic]-= 1 self. nwsum [topic]-= 1 self. ndsum [I]-= 1 Vbeta = self. dpre. words_count * self. beta Kalpha = self. K * self. alpha self. p = (self. nw [word] + self. beta)/(self. nwsum + Vbeta) * \ (self. nd [I] + self. alpha)/(self. ndsum [I] + Kalpha) for k in xrange (1, self. k): self. p [k] + = self. [k-1] u = random. uniform (0, self. p [self. k-1]) for topic in xrange (self. k): if self. p [topic]> u: break self. nw [word] [topic] + = 1 self. nwsum [topic] + = 1 self. nd [I] [topic] + = 1 self. ndsum [I] + = 1 return topic

This is the most basic LDA model implementation. The number of clusters is K, and the setting of super parameters depends on manual input. The automatically calculated version will be studied later.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python Implementation of lda model and python Implementation of lda model

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Python Implementation of lda model and python Implementation of lda model

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support