Getting Started with natural language processing (6)--The topic generation of the article based on LDA

Last Update:2018-07-24 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. LDA Overview

LDA (latent Dirichlet allocation) is a document theme generation model , also known as a three-layer Bayesian probabilistic model , containing words , themes , and document Three-tier structure. The so-called generative model, that is, we think that every word in an article is obtained by " choosing a subject in a certain probability and choosing a word from the subject with a certain probability ". The document to the subject obeys the polynomial distribution, and the topic to the word obeys the polynomial distribution.

LDA is a unsupervised machine learning technique that can be used to identify large document sets (documents collection) or corpora (Corpus) Hidden topic information . It employs the method of the Word bag (bag of words), which treats each document as a frequency vector , translating text information into digital information that is easy to model. Each document represents a probability distribution of topics, each of which represents a probability distribution of many words .

The derivation process of LDA model includes polynomial distribution ,Dirichlet distribution and Gibbs sampling . Specifically, there are a number of major applications in the following areas:
(1) To obtain the distribution of the theme of the generated documents and the generation of the theme by Dirichlet distribution sampling .
(2) The topic of the corresponding words in the current document is obtained by sampling the polynomial distribution of the subject .
(3) The words are generated by sampling the polynomial distribution of the words . 2. The topic generation of the article based on LDA

This article uses the LDA library under Python to get the corpus and compute the topic of the article .

The implementation code looks like this:

#-*-coding:utf-8-*-"" Created on Sun Aug 20:51:15 @author: Administrator "" "2017 RT NumPy As NP import LDA import lda.datasets ' 1. Import data source ' #通过LDA库自带的API接口调用路透社的数据 titles = Lda.datasets.load_reuters_ 
Titles () for I in Range (395): Print (Titles[i]) ' 2. Solve P (Word | subject) to get the distribution of the words contained in each topic ' X = Lda.datasets.load_reuters () Vocab = Lda.datasets.load_reuters_vocab () titles = Lda.datasets.load_reuters_titles () #设置主题数目为20个, each subject contains 8 words, The model iteration number is 1500 times models = Lda. LDA (n_topics=20,n_iter=1500,random_state=1) model.fit (X) Topic_word = Model.topic_word_ N_top_words = 8 for I,topic_ Dist in Enumerate (topic_word): Topic_words = Np.array (vocab) [Np.argsort (Topic_dist)] [:-(n_top_words+1): -1] #输出每个主题 The enclosed Word distribution print (' topic{}:{} '. Format (i, '. Join (topic_words))) ' 3. Solve P (Subject | document), get the theme of the article "' Doc_topic = model.doc_to Pic_ for I in range: #输出文章所对应的主题 print (' {} (top topic:{}) '. Format (Titles[i],doc_topic[i].argmax ())

the results of the operation are shown in the following illustration:

From the above figure, the number of the called DataSet is 395, the number of wordsin the article is 84010, the article theme number is 20 . the title of some of the articles is shown in the following illustration:

The distribution of the words that each topic contains is shown in the following illustration:

the corresponding theme of the article is shown in the following illustration (section, take the Top 20 articles):

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Getting Started with natural language processing (6)--The topic generation of the article based on LDA

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Getting Started with natural language processing (6)--The topic generation of the article based on LDA

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support