1. LDA Overview
LDA (latent Dirichlet allocation) is a document theme generation model , also known as a three-layer Bayesian probabilistic model , containing words , themes , and document Three-tier structure. The so-called generative model, that is, we think that every word in an article is obtained by " choosing a subject in a certain probability and choosing a word from the subject with a certain probability ". The document to the subject obeys the polynomial distribution, and the topic to the word obeys the polynomial distribution.
LDA is a unsupervised machine learning technique that can be used to identify large document sets (documents collection) or corpora (Corpus) Hidden topic information . It employs the method of the Word bag (bag of words), which treats each document as a frequency vector , translating text information into digital information that is easy to model. Each document represents a probability distribution of topics, each of which represents a probability distribution of many words .
The derivation process of LDA model includes polynomial distribution ,Dirichlet distribution and Gibbs sampling . Specifically, there are a number of major applications in the following areas:
(1) To obtain the distribution of the theme of the generated documents and the generation of the theme by Dirichlet distribution sampling .
(2) The topic of the corresponding words in the current document is obtained by sampling the polynomial distribution of the subject .
(3) The words are generated by sampling the polynomial distribution of the words . 2. The topic generation of the article based on LDA
This article uses the LDA library under Python to get the corpus and compute the topic of the article .
The implementation code looks like this:
#-*-coding:utf-8-*-"" Created on Sun Aug 20:51:15 @author: Administrator "" "2017 RT NumPy As NP import LDA import lda.datasets ' 1. Import data source ' #通过LDA库自带的API接口调用路透社的数据 titles = Lda.datasets.load_reuters_
Titles () for I in Range (395): Print (Titles[i]) ' 2. Solve P (Word | subject) to get the distribution of the words contained in each topic ' X = Lda.datasets.load_reuters () Vocab = Lda.datasets.load_reuters_vocab () titles = Lda.datasets.load_reuters_titles () #设置主题数目为20个, each subject contains 8 words, The model iteration number is 1500 times models = Lda. LDA (n_topics=20,n_iter=1500,random_state=1) model.fit (X) Topic_word = Model.topic_word_ N_top_words = 8 for I,topic_ Dist in Enumerate (topic_word): Topic_words = Np.array (vocab) [Np.argsort (Topic_dist)] [:-(n_top_words+1): -1] #输出每个主题 The enclosed Word distribution print (' topic{}:{} '. Format (i, '. Join (topic_words))) ' 3. Solve P (Subject | document), get the theme of the article "' Doc_topic = model.doc_to Pic_ for I in range: #输出文章所对应的主题 print (' {} (top topic:{}) '. Format (Titles[i],doc_topic[i].argmax ())
the results of the operation are shown in the following illustration:
From the above figure, the number of the called DataSet is 395, the number of wordsin the article is 84010, the article theme number is 20 . the title of some of the articles is shown in the following illustration:
The distribution of the words that each topic contains is shown in the following illustration:
the corresponding theme of the article is shown in the following illustration (section, take the Top 20 articles):