Getting Started with natural language processing (6)--The topic generation of the article based on LDA

Source: Internet
Author: User
1. LDA Overview

LDA (latent Dirichlet allocation) is a document theme generation model , also known as a three-layer Bayesian probabilistic model , containing words , themes , and document Three-tier structure. The so-called generative model, that is, we think that every word in an article is obtained by " choosing a subject in a certain probability and choosing a word from the subject with a certain probability ". The document to the subject obeys the polynomial distribution, and the topic to the word obeys the polynomial distribution.

LDA is a unsupervised machine learning technique that can be used to identify large document sets (documents collection) or corpora (Corpus) Hidden topic information . It employs the method of the Word bag (bag of words), which treats each document as a frequency vector , translating text information into digital information that is easy to model. Each document represents a probability distribution of topics, each of which represents a probability distribution of many words .

The derivation process of LDA model includes polynomial distribution ,Dirichlet distribution and Gibbs sampling . Specifically, there are a number of major applications in the following areas:
(1) To obtain the distribution of the theme of the generated documents and the generation of the theme by Dirichlet distribution sampling .
(2) The topic of the corresponding words in the current document is obtained by sampling the polynomial distribution of the subject .
(3) The words are generated by sampling the polynomial distribution of the words . 2. The topic generation of the article based on LDA

This article uses the LDA library under Python to get the corpus and compute the topic of the article .

The implementation code looks like this:

#-*-coding:utf-8-*-"" Created on Sun Aug 20:51:15 @author: Administrator "" "2017 RT NumPy As NP import LDA import lda.datasets ' 1. Import data source ' #通过LDA库自带的API接口调用路透社的数据 titles = Lda.datasets.load_reuters_ 
Titles () for I in Range (395): Print (Titles[i]) ' 2. Solve P (Word | subject) to get the distribution of the words contained in each topic ' X = Lda.datasets.load_reuters () Vocab = Lda.datasets.load_reuters_vocab () titles = Lda.datasets.load_reuters_titles () #设置主题数目为20个, each subject contains 8 words, The model iteration number is 1500 times models = Lda. LDA (n_topics=20,n_iter=1500,random_state=1) model.fit (X) Topic_word = Model.topic_word_ N_top_words = 8 for I,topic_ Dist in Enumerate (topic_word): Topic_words = Np.array (vocab) [Np.argsort (Topic_dist)] [:-(n_top_words+1): -1] #输出每个主题 The enclosed Word distribution print (' topic{}:{} '. Format (i, '. Join (topic_words))) ' 3. Solve P (Subject | document), get the theme of the article "' Doc_topic = model.doc_to Pic_ for I in range: #输出文章所对应的主题 print (' {} (top topic:{}) '. Format (Titles[i],doc_topic[i].argmax ()) 

the results of the operation are shown in the following illustration:

From the above figure, the number of the called DataSet is 395, the number of wordsin the article is 84010, the article theme number is 20 . the title of some of the articles is shown in the following illustration:

The distribution of the words that each topic contains is shown in the following illustration:

the corresponding theme of the article is shown in the following illustration (section, take the Top 20 articles):

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.