"Stove-refining AI" machine learning 042-NLP Theme modeling of text

Last Update:2018-10-18 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

(Python libraries and version numbers used in this article: Python 3.6, Numpy 1.14, Scikit-learn 0.19, matplotlib 2.2, NLTK 3.3)

The process of using NLP to identify a pattern hidden in a text document when the subject of the text is modeled, you can discover the hidden subject of the document to analyze the document. The realization of theme modeling is to identify the most meaningful and most representational words in a text document to achieve the topic classification, that is, to find the keywords in a text document, and to identify the hidden subject of a document by keyword.

1. Prepare the data set

The data set used in this session is stored in a TXT document, so the text needs to be loaded from the TXT document before it is preprocessed. Because of the many steps of preprocessing, a class is created here to complete the loading and preprocessing of data, which makes the code look more concise and more general.

# Prepare the data set, build a class to load the dataset, preprocess the data from nltk.tokenize import regexptokenizerfrom nltk.corpus import Stopwordsfrom Nltk.stem.snowball import snowballstemmerfrom Gensim import models, Corporaclass dataset:def __init__ (self,txt_fil E_path): Self.__txt_file=txt_file_path def __load_txt (self): # load this content from a TXT document, read it line by row with open (Self.__tx T_file, ' R ') as File:content=file.readlines () # reads all rows in a single return [line[:-1] for lines in content] # remove each \ n def __tokenize (self,lines_list) at the end of a line: # preprocessing one: Word breaker for each line of text Tokenizer=regexptokenizer (' \w+ ') # use regular Expression word breaker instead of word_tokenize: excludes words with punctuation return [Tokenizer.tokenize (Line.lower ()) for line in Lines_list] def __ Remove_stops (self,lines_list): # preprocessing Two: Remove the stop word for each line # We want to remove some of the deactivation words and avoid the noise of these words, so we need a stop vocabulary stop_words_list=stopword  S.words (' 中文版 ') # get the English Stop glossary return [[token for token in line if token not in stop_words_list] for Line in Lines_list] # It's a little hard here.In order to understand, Lines_list contains the element is also a list, this list is a line of text, # and a line of text inside there are n participle composition, so lines_list can see two-dimensional array, need to use a double layer generator def __word_ste MM (self,lines_list): # preprocessing three: stemming each participle stemmer=snowballstemmer (' 中文版 ') return [[Stemmer.stem (Word) for Word in line] lines_list "def prepare (self):" External call function for preparing data set "# First load this content from TXT file, then Word, then remove the discontinued word, then stem extract Stemmed_words=self.__word_stemm (Self.__remove_stops (Self.__tokenize (Self.__load_txt ())) # The following modeling needs to use the Dict-based word matrix, so first constructs the dict with corpora in the establishment word matrix Dict_words=corpora. Dictionary (stemmed_words) Matrix_words=[dict_words.doc2bow (text) for text in Stemmed_words] return Dict_word S, Matrix_words # The following functions are primarily used to test whether the above functions are running normal def get_content (self): return Self.__load_txt () def get _tokenize (self): return Self.__tokenize (Self.__load_txt ()) def get_remove_stops (self): return self.__ Remove_stops (Self.__tokenize (Self.__load_txt ())) def Get_word_stemm (sELF): Return Self.__word_stemm (Self.__remove_stops (Self.__tokenize (Self.__load_txt ())))

Is this class working properly and can we get the results we expected? You can use the following code to test

# 检验上述DataSet类是否运行正常dataset=DataSet("E:\PyProjects\DataSet\FireAI\data_topic_modeling.txt")# 以下测试load_txt()函数是否正常content=dataset.get_content()print(len(content))print(content[:3])# 以下测试__tokenize()函数是否正常tokenized=dataset.get_tokenize()print(tokenized)# 一下测试__remove_stops()函数是否正常removed=dataset.get_remove_stops()print(removed)# 以下测试__word_stemm()函数是否正常stemmed=dataset.get_word_stemm()print(stemmed)# 以下测试prepare函数是否正常_,prepared=dataset.prepare()print(prepared)

The output will run for a long time, so look at my GitHub source code.

2. Build the model and train the data set

We use the LDA model (latent Dirichlet Allocation, LDA) to do thematic modeling, as follows:

# 获取数据集dataset=DataSet("E:\PyProjects\DataSet\FireAI\data_topic_modeling.txt")dict_words, matrix_words =dataset.prepare()# 使用LDAModel建模lda_model=models.ldamodel.LdaModel(matrix_words,num_topics=2,                           id2word=dict_words,passes=25) # 此处假设原始文档有两个主题

The above code will establish Ldamodel and train the model, it should be noted that Ldamodel is located in the Gensim module, this module needs to be installed with Pip install Gensim, before the installation can be used.

Ldamodel calculates the importance of each word, and then establishes the importance calculation equation, which relies on this equation to give a predictive subject.

The following code can print out the importance equation:

# 查看模型中最重要的N个单词print(‘Most important words to topics: ‘)for item in lda_model.print_topics(num_topics=2,num_words=5):    # 此处只打印最重要的5个单词    print(‘Topic: {}, words: {}‘.format(item[0],item[1]))

-------------------------------------lose-----------------------------------------

Most important words to topics:
topic:0, words:0.075"need" + 0.053"order" + 0.032"system" + 0.032"Encrypt" + 0.032"work"
Topic:1, words:0.037"younger" + 0.037"develop" + 0.037"Promot" + 0.037"talent" + 0.037"Train"

--------------------------------------------finished-------------------------------------

####################### #小 ********** Knot ###############################

1, the General machine learning project needs our own content is the data set, you can write the data set processing process into a special class, such as above I write the text preprocessing process in class, each function represents a preprocessing method, so well-organized, with a certain commonality.

2, here we use the Gensim module in the Ldamodel to do the theme modeling, Gensim module is a very useful NLP processing tool, in the text content analysis application more.

#################################################################

Note: This section of the code has been uploaded to ( my GitHub), Welcome to download.

Resources:

1, Python machine learning classic example, Prateek Joshi, Tao Junjie, Chen Xiaoli translation

"Stove-refining AI" machine learning 042-NLP Theme modeling of text

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

"Stove-refining AI" machine learning 042-NLP Theme modeling of text

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support