Implementation of the LDA model in Python

Last Update:2015-08-11 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

LDA (latent Dirichlet Allocation) is a document topic generation model that has recently seen a bit of data ready to be implemented using Python. As for the mathematical model of the relevant knowledge, a lot of some, here also gives a very detailed document previously referenced the LDA algorithm roaming guide
This post only speaks of the algorithm of the sampling method Python implementation.
Full implementation of Project open source Python-lda

Application and initialization of LDA model variables

##Pseudo Code#Input: Article collection (after word processing), K (number of classes) output: The LDA model that has been randomly dispatched once the begin application several statistics: p probability vector dimension: K NW Term Distribution dimension on class: M*K The total number of words where M is a collection of articles nwsum the total number of words on each class: K nd Each article, the number of words in each class is distributed in the dimension: V*K where V is the total number of articles ndsum the total number of words in each article dimension: v Z each word assigns a class dimension: v*number of words per article theta articlesThe probability distribution dimension of the class: v*K Phi ClassProbability distribution dimension of Word: k*M#initializing a random allocation class     forXinchNumber of articles: Statistics ndsum[article id][number of words] forYinchNumber of words in each article: randomly assigning a class word to all words the number of distributions on this class+1The number of such words in this article+1The total number of words in this class+1End

##implement code snippets and see GitHub projects in more detail#classLdamodel (object):def __init__(self,dpre): Self.dpre= Dpre#Get preprocessing parameters        #        #Model Parameters        #Number of clusters K, number of iterations iter_times, each class feature word number top_words_num, hyper-parameter α (Alpha) Beta (Beta)        #Self. K =K Self.beta=Beta Self.alpha=Alpha Self.iter_times=iter_times Self.top_words_num=Top_words_num#        #File Variables        #document Trainfile of good word        #Word Correspondence ID file wordidmapfile        #article-Subject distribution file Thetafile        #Word-Subject distribution file Phifile        #topn Word files per topic Topnfile        #Final Dispatch result file Tassginfile        #Model Training Selected parameter file Paramfile        #Self.wordidmapfile =wordidmapfile Self.trainfile=trainfile Self.thetafile=thetafile Self.phifile=phifile Self.topnfile=topnfile Self.tassginfile=tassginfile Self.paramfile=Paramfile#p, probability vector double type, storing temporary variables for sampling        #NW, Word Word distribution on theme topic        #Nwsum, the total number of words per topic        #nd, the total number of words for each topic in each doc        #Ndsum, the total number of each doc morphemesSELF.P =Np.zeros (self. K) SELF.NW= Np.zeros ((self.dpre.words_count,self. K), dtype="int") Self.nwsum= Np.zeros (self. K,dtype="int") Self.nd= Np.zeros ((self.dpre.docs_count,self. K), dtype="int") Self.ndsum= Np.zeros (dpre.docs_count,dtype="int") self. Z= Np.array ([[0] forYinchXrange (Dpre.docs[x].length)] forXinchXrange (Dpre.docs_count)])#m*doc.size (), topic Distribution for document Morphemes        #random First allocation type         forXinchxrange (len (self). Z)): Self.ndsum[x]=Self.dpre.docs[x].length forYinchxrange (self.dpre.docs[x].length): Topic= Random.randint (0,self. K-1) self. Z[x][y]=Topic Self.nw[self.dpre.docs[x].words[y]][topic]+ = 1Self.nd[x][topic]+ = 1Self.nwsum[topic]+ = 1Self.theta= Np.array ([[0.0] forYinchXrange (self. K)] forXinchxrange (Self.dpre.docs_count)]) Self.phi= Np.array ([[0.0] forYinchXrange (Self.dpre.words_count)] forXinchXrange (self. K)])

Sampling sampling Process

##Pseudo Code#input: Lda_model after initialization, number of iterations iter_times, hyper-parameter α, β, number of clusters K output: theta (the distribution probability of the corresponding class), Phi (the distribution probability of the corresponding word), tassgin (the result of assigning class for each word in the article), Twords (TopN of high-frequency words per class) begin forIinchNumber of iterations: forMinchNumber of articles: forVincharticle morphemes: Take topic=Z[m][v] The statistics of Nw[v][topic], Nwsum[topic], Nd[m][topic]-1calculate probability p[]#p[] The probability that this word belongs to each topic                 forKinch(1, number of classes-1): P[k]+ = P[k-1] and the statistics of the New topic order (Nw[v][new_topic], nwsum[new_topic], nd[m][new_topic) are recorded randomly .+1#After the iteration is completeOutput Model End

#Code Snippets    defsampling (SELF,I,J): Topic=Self . Z[I][J] Word=Self.dpre.docs[i].words[j] Self.nw[word][topic]-= 1Self.nd[i][topic]-= 1Self.nwsum[topic]-= 1Self.ndsum[i]-= 1Vbeta= Self.dpre.words_count *Self.beta Kalpha= self. KSelf.alpha SELF.P= (Self.nw[word] + Self.beta)/(Self.nwsum + vbeta) *(Self.nd[i]+ Self.alpha)/(Self.ndsum[i] +Kalpha) forKinchXrange (1, self. K): Self.p[k]+ = Self.p[k-1] U= Random.uniform (0,self.p[self. K-1])         forTopicinchxrange (self. K):ifSelf.p[topic]>u: BreakSelf.nw[word][topic]+=1Self.nwsum[topic]+=1Self.nd[i][topic]+=1Self.ndsum[i]+=1returnTopic

This implementation is the most basic LDA model implementation, the number of clusters K, and the parameters of the super-parameter set to rely on manual input, the automatic calculation of the version will be studied later.

Implementation of the LDA model in Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Implementation of the LDA model in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Implementation of the LDA model in Python

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support