- LDA (latent Dirichlet Allocation) is a document topic generation model that has recently seen a bit of data ready to be implemented using Python. As for the mathematical model of the relevant knowledge, a lot of some, here also gives a very detailed document previously referenced the LDA algorithm roaming guide
- This post only speaks of the algorithm of the sampling method Python implementation.
- Full implementation of Project open source Python-lda
- Application and initialization of LDA model variables
##Pseudo Code#Input: Article collection (after word processing), K (number of classes) output: The LDA model that has been randomly dispatched once the begin application several statistics: p probability vector dimension: K NW Term Distribution dimension on class: M*K The total number of words where M is a collection of articles nwsum the total number of words on each class: K nd Each article, the number of words in each class is distributed in the dimension: V*K where V is the total number of articles ndsum the total number of words in each article dimension: v Z each word assigns a class dimension: v*number of words per article theta articlesThe probability distribution dimension of the class: v*K Phi ClassProbability distribution dimension of Word: k*M#initializing a random allocation class forXinchNumber of articles: Statistics ndsum[article id][number of words] forYinchNumber of words in each article: randomly assigning a class word to all words the number of distributions on this class+1The number of such words in this article+1The total number of words in this class+1End
##implement code snippets and see GitHub projects in more detail#classLdamodel (object):def __init__(self,dpre): Self.dpre= Dpre#Get preprocessing parameters # #Model Parameters #Number of clusters K, number of iterations iter_times, each class feature word number top_words_num, hyper-parameter α (Alpha) Beta (Beta) #Self. K =K Self.beta=Beta Self.alpha=Alpha Self.iter_times=iter_times Self.top_words_num=Top_words_num# #File Variables #document Trainfile of good word #Word Correspondence ID file wordidmapfile #article-Subject distribution file Thetafile #Word-Subject distribution file Phifile #topn Word files per topic Topnfile #Final Dispatch result file Tassginfile #Model Training Selected parameter file Paramfile #Self.wordidmapfile =wordidmapfile Self.trainfile=trainfile Self.thetafile=thetafile Self.phifile=phifile Self.topnfile=topnfile Self.tassginfile=tassginfile Self.paramfile=Paramfile#p, probability vector double type, storing temporary variables for sampling #NW, Word Word distribution on theme topic #Nwsum, the total number of words per topic #nd, the total number of words for each topic in each doc #Ndsum, the total number of each doc morphemesSELF.P =Np.zeros (self. K) SELF.NW= Np.zeros ((self.dpre.words_count,self. K), dtype="int") Self.nwsum= Np.zeros (self. K,dtype="int") Self.nd= Np.zeros ((self.dpre.docs_count,self. K), dtype="int") Self.ndsum= Np.zeros (dpre.docs_count,dtype="int") self. Z= Np.array ([[0] forYinchXrange (Dpre.docs[x].length)] forXinchXrange (Dpre.docs_count)])#m*doc.size (), topic Distribution for document Morphemes #random First allocation type forXinchxrange (len (self). Z)): Self.ndsum[x]=Self.dpre.docs[x].length forYinchxrange (self.dpre.docs[x].length): Topic= Random.randint (0,self. K-1) self. Z[x][y]=Topic Self.nw[self.dpre.docs[x].words[y]][topic]+ = 1Self.nd[x][topic]+ = 1Self.nwsum[topic]+ = 1Self.theta= Np.array ([[0.0] forYinchXrange (self. K)] forXinchxrange (Self.dpre.docs_count)]) Self.phi= Np.array ([[0.0] forYinchXrange (Self.dpre.words_count)] forXinchXrange (self. K)])
- Sampling sampling Process
##Pseudo Code#input: Lda_model after initialization, number of iterations iter_times, hyper-parameter α, β, number of clusters K output: theta (the distribution probability of the corresponding class), Phi (the distribution probability of the corresponding word), tassgin (the result of assigning class for each word in the article), Twords (TopN of high-frequency words per class) begin forIinchNumber of iterations: forMinchNumber of articles: forVincharticle morphemes: Take topic=Z[m][v] The statistics of Nw[v][topic], Nwsum[topic], Nd[m][topic]-1calculate probability p[]#p[] The probability that this word belongs to each topic forKinch(1, number of classes-1): P[k]+ = P[k-1] and the statistics of the New topic order (Nw[v][new_topic], nwsum[new_topic], nd[m][new_topic) are recorded randomly .+1#After the iteration is completeOutput Model End
#Code Snippets defsampling (SELF,I,J): Topic=Self . Z[I][J] Word=Self.dpre.docs[i].words[j] Self.nw[word][topic]-= 1Self.nd[i][topic]-= 1Self.nwsum[topic]-= 1Self.ndsum[i]-= 1Vbeta= Self.dpre.words_count *Self.beta Kalpha= self. KSelf.alpha SELF.P= (Self.nw[word] + Self.beta)/(Self.nwsum + vbeta) *(Self.nd[i]+ Self.alpha)/(Self.ndsum[i] +Kalpha) forKinchXrange (1, self. K): Self.p[k]+ = Self.p[k-1] U= Random.uniform (0,self.p[self. K-1]) forTopicinchxrange (self. K):ifSelf.p[topic]>u: BreakSelf.nw[word][topic]+=1Self.nwsum[topic]+=1Self.nd[i][topic]+=1Self.ndsum[i]+=1returnTopic
This implementation is the most basic LDA model implementation, the number of clusters K, and the parameters of the super-parameter set to rely on manual input, the automatic calculation of the version will be studied later.
Implementation of the LDA model in Python