Implementation of the LDA model in Python

Source: Internet
Author: User

    • LDA (latent Dirichlet Allocation) is a document topic generation model that has recently seen a bit of data ready to be implemented using Python. As for the mathematical model of the relevant knowledge, a lot of some, here also gives a very detailed document previously referenced the LDA algorithm roaming guide
    • This post only speaks of the algorithm of the sampling method Python implementation.
    • Full implementation of Project open source Python-lda

    • Application and initialization of LDA model variables
##Pseudo Code#Input: Article collection (after word processing), K (number of classes) output: The LDA model that has been randomly dispatched once the begin application several statistics: p probability vector dimension: K NW Term Distribution dimension on class: M*K The total number of words where M is a collection of articles nwsum the total number of words on each class: K nd Each article, the number of words in each class is distributed in the dimension: V*K where V is the total number of articles ndsum the total number of words in each article dimension: v Z each word assigns a class dimension: v*number of words per article theta articlesThe probability distribution dimension of the class: v*K Phi ClassProbability distribution dimension of Word: k*M#initializing a random allocation class     forXinchNumber of articles: Statistics ndsum[article id][number of words] forYinchNumber of words in each article: randomly assigning a class word to all words the number of distributions on this class+1The number of such words in this article+1The total number of words in this class+1End
##implement code snippets and see GitHub projects in more detail#classLdamodel (object):def __init__(self,dpre): Self.dpre= Dpre#Get preprocessing parameters        #        #Model Parameters        #Number of clusters K, number of iterations iter_times, each class feature word number top_words_num, hyper-parameter α (Alpha) Beta (Beta)        #Self. K =K Self.beta=Beta Self.alpha=Alpha Self.iter_times=iter_times Self.top_words_num=Top_words_num#        #File Variables        #document Trainfile of good word        #Word Correspondence ID file wordidmapfile        #article-Subject distribution file Thetafile        #Word-Subject distribution file Phifile        #topn Word files per topic Topnfile        #Final Dispatch result file Tassginfile        #Model Training Selected parameter file Paramfile        #Self.wordidmapfile =wordidmapfile Self.trainfile=trainfile Self.thetafile=thetafile Self.phifile=phifile Self.topnfile=topnfile Self.tassginfile=tassginfile Self.paramfile=Paramfile#p, probability vector double type, storing temporary variables for sampling        #NW, Word Word distribution on theme topic        #Nwsum, the total number of words per topic        #nd, the total number of words for each topic in each doc        #Ndsum, the total number of each doc morphemesSELF.P =Np.zeros (self. K) SELF.NW= Np.zeros ((self.dpre.words_count,self. K), dtype="int") Self.nwsum= Np.zeros (self. K,dtype="int") Self.nd= Np.zeros ((self.dpre.docs_count,self. K), dtype="int") Self.ndsum= Np.zeros (dpre.docs_count,dtype="int") self. Z= Np.array ([[0] forYinchXrange (Dpre.docs[x].length)] forXinchXrange (Dpre.docs_count)])#m*doc.size (), topic Distribution for document Morphemes        #random First allocation type         forXinchxrange (len (self). Z)): Self.ndsum[x]=Self.dpre.docs[x].length forYinchxrange (self.dpre.docs[x].length): Topic= Random.randint (0,self. K-1) self. Z[x][y]=Topic Self.nw[self.dpre.docs[x].words[y]][topic]+ = 1Self.nd[x][topic]+ = 1Self.nwsum[topic]+ = 1Self.theta= Np.array ([[0.0] forYinchXrange (self. K)] forXinchxrange (Self.dpre.docs_count)]) Self.phi= Np.array ([[0.0] forYinchXrange (Self.dpre.words_count)] forXinchXrange (self. K)])
    • Sampling sampling Process
##Pseudo Code#input: Lda_model after initialization, number of iterations iter_times, hyper-parameter α, β, number of clusters K output: theta (the distribution probability of the corresponding class), Phi (the distribution probability of the corresponding word), tassgin (the result of assigning class for each word in the article), Twords (TopN of high-frequency words per class) begin forIinchNumber of iterations: forMinchNumber of articles: forVincharticle morphemes: Take topic=Z[m][v] The statistics of Nw[v][topic], Nwsum[topic], Nd[m][topic]-1calculate probability p[]#p[] The probability that this word belongs to each topic                 forKinch(1, number of classes-1): P[k]+ = P[k-1] and the statistics of the New topic order (Nw[v][new_topic], nwsum[new_topic], nd[m][new_topic) are recorded randomly .+1#After the iteration is completeOutput Model End
#Code Snippets    defsampling (SELF,I,J): Topic=Self . Z[I][J] Word=Self.dpre.docs[i].words[j] Self.nw[word][topic]-= 1Self.nd[i][topic]-= 1Self.nwsum[topic]-= 1Self.ndsum[i]-= 1Vbeta= Self.dpre.words_count *Self.beta Kalpha= self. KSelf.alpha SELF.P= (Self.nw[word] + Self.beta)/(Self.nwsum + vbeta) *(Self.nd[i]+ Self.alpha)/(Self.ndsum[i] +Kalpha) forKinchXrange (1, self. K): Self.p[k]+ = Self.p[k-1] U= Random.uniform (0,self.p[self. K-1])         forTopicinchxrange (self. K):ifSelf.p[topic]>u: BreakSelf.nw[word][topic]+=1Self.nwsum[topic]+=1Self.nd[i][topic]+=1Self.ndsum[i]+=1returnTopic

This implementation is the most basic LDA model implementation, the number of clusters K, and the parameters of the super-parameter set to rely on manual input, the automatic calculation of the version will be studied later.

Implementation of the LDA model in Python

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.