Introduction to Natural language Processing (8)--textrank

Source: Internet
Author: User
Tags processing text

Textrank is a common keyword extraction algorithm in natural language processing, which can be used to extract keywords, phrases and automatically generate text summaries. Textrank is improved by PageRank algorithm, so there are a lot of reference to the idea of PageRank, the process of processing text data mainly includes the following steps:

(1) First of all, the original text into a sentence, in each sentence to filter out the stop word (you can not choose), and only to retain the specified part of the word, which can be a collection of sentences and words.

(2) Each word as a node in the PageRank. Set the window size to K, assuming a sentence composed of words can be expressed as w1,w2,w3,..., WN.

W1,w2, ..., wk, w2,w3,..., wk+1, w3,w4,..., wk+2 are all a window in a window, there is a side of any two words without a right.

(3) The importance of each node can be calculated based on the above node and edge composition diagram. Some of the most important words can be used as keywords to differentiate between text categories and topics.


The Python code implementation based on the honor V10 mobile Comment data is as follows:

#-*-Coding:utf-8-*-
"" "
Created on Fri Feb 9 15:58:14 2018

@author: Zch
" ""

Import codecs
fro M Textrank4zh import Textrank4keyword, textrank4sentence

#读取华为荣耀天猫旗舰店荣耀V10手机的评论文本数据
text = Codecs.open (' d:// Data/tmall/origin_tmall_review.txt ', ' r ', ' Utf-8 '). Read ()

tr4w = Textrank4keyword ()

tr4w.analyze (text=text , Lower=true, window=2)

print (' keywords: ') for
item in Tr4w.get_keywords (word_min_len=1):
print ("{}" appears Frequency: {:. 6f} '. Format (Item.word, item.weight))

print (' key phrase: ') for
phrase in tr4w.get_keyphrases (Keywords_ num=10, min_occur_num=5):
print (phrase)

tr4s = textrank4sentence () tr4s.analyze
(Text=text, lower= True, Source = ' all_filters ')

print ()
print (' Summary: ') for
item in Tr4s.get_key_sentences (num=3):
#i Ndex is the position of the statement in the text, weight is the weight
of print ("{:. 6f}, Content is: {}". Format (Item.index, Item.weight, item.sentence)) 

The OUTPUT keyword is shown in the following illustration:

The key phrases for the output are shown in the following illustration:

The output is summarized as shown in the following illustration:

From the above output can be seen, Huawei Glory V10 Comment information, most of the more positive, positive, can basically reflect the user's attitude to the phone products.





Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.