Python Implementation of lda model and python Implementation of lda model

Source: Internet
Author: User

Python Implementation of lda model and python Implementation of lda model

  • LDA (Latent Dirichlet Allocation) is a document topic generation model. I recently read some documents and want to implement it using python. As for the knowledge of mathematical models, there is a lot of experience. Here we also provide a very detailed reference to the lda algorithm roaming guide.
  • This blog post only describes the python Implementation of the algorithm sampling method.
  • Open-source python-LDA

 

  • Lda model variable application and initialization
# Pseudocode # input: article set (after word segmentation), K (number of classes) Output: the lda model has been randomly assigned once to apply for several statistics for begin: p probability vector dimension: K nw word distribution dimension on the class: M * K where M is the total number of words in the collection of articles nwsum total number of words in each class dimension: K nd in each article, word count distribution dimension of each class: V * K, where V indicates the total number of articles. ndsum indicates the total number of words in each article. dimension: v z. Each word is assigned a class dimension: V * Number of words in each article theta article-> probability distribution dimension of classes: V * K phi class-> probability distribution dimension of words: K * M # initialize the random allocation class for x in number of articles: Count ndsum [Article id] [number of words] for y in the number of words in each article: randomly assign a class word to all words. The number of words distributed in this class + 1 the number of such words in this article + 1 the total number of such words + 1 end
# Code snippets. For more details, refer to the github Project # class LDAModel (object): def _ init _ (self, dpre): self. dpre = dpre # obtain preprocessing parameters # model parameters # number of clusters K, number of iterations iter_times, number of feature words per class top_words_num, hyperparameter alpha (alpha) β (beta) # self. K = K self. beta = beta self. alpha = alpha self. iter_times = iter_times self. top_words_num = top_words_num # file variable # file trainfile with good words # word corresponds to the id file wordidmapfile # Article-topic distribution file thetafile # Word-topic distribution file phifile # topNfile # final assignment result file tassginfile # parameter file selected for model training paramfile # self. wordidmapfile = wordidmapfile self. trainfile = trainfile self. thetafile = thetafile self. phifile = phifile self. topNfile = topNfile self. tassginfile = tassginfile self. paramfile = paramfile # p, probability Vector double type, storing the temporary variables for sampling # nw, word distribution on topic # nwsum, total number of words per topic # nd, total number of words in each topic in each doc # ndsum, total number of words in each doc self. p = np. zeros (self. k) self. nw = np. zeros (self. dpre. words_count, self. k), dtype = "int") self. nwsum = np. zeros (self. k, dtype = "int") self. nd = np.zeros(self.dpre.doc s_count, self. k), dtype = "int") self. ndsum = np.zeros(dpre.doc s_count, dtype = "int") self. Z = np. array ([[0 for y in xrange(dpre.doc s [x]. length)] for x in xrange(dpre.doc s_count)]) # M * doc. size (), topic distribution of words in the Document # random first allocation type for x in xrange (len (self. z): self. ndsum [x] = self.dpre.doc s [x]. length for y in xrange(self.dpre.doc s [x]. length): topic = random. randint (0, self. k-1) self. Z [x] [y] = topic self.nw[self.dpre.doc s [x]. words [y] [topic] + = 1 self. nd [x] [topic] + = 1 self. nwsum [topic] + = 1 self. theta = np. array ([[0.0 for y in xrange (self. k)] for x in xrange(self.dpre.doc s_count)]) self. phi = np. array ([[0.0 for y in xrange (self. dpre. words_count)] for x in xrange (self. k)])
  • Sampling process
# Pseudo code # input: initialized lda_model, number of iterations iter_times, hyperparameter α, β, number of clusters K output: theta (distribution probability of the corresponding class in the article ), phi (Class-corresponding word distribution probability), tassgin (assignment class result of each word in the article), and twords (Top N high-frequency words of each class) begin for I in iterations: for m in number of articles: for v in words: take the statistics of topic = Z [m] [v] to make nw [v] [topic], nwsum [topic], and nd [m] [topic]-1 to calculate the probability p [] # p [] the probability that the word belongs to each topic for k in (1, number of classes-1): p [k] + = p [k-1] and then randomly assigned, the statistics of the new topic that is recorded include nw [v] [new_topic], nwsum [new_topic], and nd [m] [new_topic] + 1 # output model end after iteration is complete
# Code snippet def sampling (self, I, j): topic = self. Z [I] [j] word = self.dpre.doc s [I]. words [j] self. nw [word] [topic]-= 1 self. nd [I] [topic]-= 1 self. nwsum [topic]-= 1 self. ndsum [I]-= 1 Vbeta = self. dpre. words_count * self. beta Kalpha = self. K * self. alpha self. p = (self. nw [word] + self. beta)/(self. nwsum + Vbeta) * \ (self. nd [I] + self. alpha)/(self. ndsum [I] + Kalpha) for k in xrange (1, self. k): self. p [k] + = self. [k-1] u = random. uniform (0, self. p [self. k-1]) for topic in xrange (self. k): if self. p [topic]> u: break self. nw [word] [topic] + = 1 self. nwsum [topic] + = 1 self. nd [I] [topic] + = 1 self. ndsum [I] + = 1 return topic

 

This is the most basic LDA model implementation. The number of clusters is K, and the setting of super parameters depends on manual input. The automatically calculated version will be studied later.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.