Using the Textrank algorithm to generate keywords and summaries for text

Last Update:2014-12-01 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The Textrank algorithm is based on PageRank, which is used to generate keywords and summaries for text. The papers are:

mihalcea R, Tarau P. textrank:bringing order into texts[c]. Association for Computational Linguistics, 2004.

Start with the PageRank.

PageRank

PageRank first used to calculate the importance of Web pages. The whole www can be regarded as a graph with a map, and a node is a webpage. If page A has a link to page B, there is a directed edge from page A to page B.

After you have finished building the diagram, use the following formula:

S (Vi) is the importance of the Web page I (pr value). D is the damping factor, typically set to 0.85. In (Vi) is a collection of web pages that have links to page I. Out (VJ) is a collection of links to Web pages where links exist in page J. | Out (VJ) | is the number of elements in the collection.

PageRank need to use the above formula multiple iterations to get the results. Initially, you can set the importance of each page to 1. The above formula equals to the left side of the result is the PR value of page I after the iteration, the PR value to the right of the equal sign is all before the iteration.

As an example:

Represents the link relationship between three pages, intuitively Web page A is the most important. You can get the following table:

End \ Start	A	B	C
A	0	1	1
B	0	0	0
C	0	0	0

The bars represent the nodes in fact, and the columns represent the end nodes. If there is a link relationship between two nodes, the corresponding value is 1.

According to the formula, each vertical column needs to be normalized (the sum of each element/element), and the normalized result is:

End \ Start	A	B	C
A	0	1	1
B	0	0	0
C	0	0	0

The above results form the matrix M. We use MATLAB to iterate 100 times to see the importance of each last page:

M = [0 1 1     0 0 0 0    0 0]; PR = [1; 1; 1];for iter = 1:100    PR = 0.15 + 0.85*m*pr;    DISP (ITER);    Disp (PR); end

Run results (omitted):

......    0.4050    0.1500    0.1500    0.4050    0.1500    0.1500    0.1500 0.4050    0.1500    98    0.4050    0.1500    0.1500    100 0.4050    0.1500    0.1500    0.4050    0.1500    0.1500

The PR value of final A is 0.4050,b and C is 0.1500.

If the upper side is considered to be non-directional (in fact, it is bidirectional), then:

M = [0 1 1     0.5 0 0    0.5 0 0]; PR = [1; 1; 1];for iter = 1:100    PR = 0.15 + 0.85*m*pr;    DISP (ITER);    Disp (PR); end

Run results (omitted):

.....    98    1.4595    0.7703    0.7703    1.4595 0.7703 0.7703 0.7703    0.7703

The importance of a, B and C can still be judged.

Extracting keywords using textrank

Splits the original text into sentences, filters out the inactive words in each sentence (optional), and retains only words of the specified part of speech (optional). This allows you to get a set of sentences and a set of words.

Each word acts as a node in the PageRank. Set the window size to K, assuming that a sentence consists of the following words in turn:

W1, W2, W3, W4, W5, ..., WN

W1, W2, ..., wk,W2, W3, ..., wk+1, W3, W4,..., wk+2 , etc. are all a window. There is an no-right edge between the nodes that correspond to any two words in a window.

Based on the diagram above, it is possible to calculate the importance of each word node. Some of the most important words can be used as keywords.

Extracting key phrases using Textrank

Refer to "Extracting keywords using textrank" to extract several keywords. There are several keywords in the Wakahara text that are adjacent to each other, then these keywords can form a key phrase.

For example, in a "Support vector machine" article, you can find three keywords support, vector, machine , through the key phrase extraction, you can get support vector machine .

Extract abstracts using Textrank

Each sentence is considered as a node in the graph, if there is similarity between the two sentences, it is believed that the corresponding two nodes have a non-direction right side, and the weights are similar degrees.

Some of the most important sentences calculated by the PageRank algorithm can be used as summaries.

This paper uses the following formula to calculate the similarity between Si and SJ in two sentences:

A molecule is the number of words that appear in two sentences. Si| is the number of words in sentence i.

Since it is the right graph, the PageRank formula is modified slightly:

Implement Textrank

Because we want to test a variety of situations, I implemented a Python 2.7-based Textrank for Chinese text library Textrank4zh. In:

Https://github.com/someus/TextRank4ZH

Here is an example:

#-*-encoding:utf-8-*-import codecsfrom textrank4zh import textrank4keyword, Textrank4sentencetext = Codecs.open ('./ Text/01.txt ', ' r ', ' Utf-8 '). Read () tr4w = Textrank4keyword (stop_words_file= './stopword.data ')  # import Stop word # Use speech filtering, Text lowercase, Windows 2tr4w.train (Text=text, Speech_tag_filter=true, Lower=true, window=2)  print ' keywords: ' # 20 keywords each with a minimum length of 1print '/'. Join (Tr4w.get_keywords (word_min_len=1))  print ' key phrase: ' # 20 keywords to construct a phrase, The phrase appears in the original text in the minimum number of 2print '/'. Join (Tr4w.get_keyphrases (keywords_num=20, min_occur_num= 2))      Tr4s = Textrank4sentence ( Stop_words_file= './stopword.data ') # Use speech filtering, text lowercase, use words_all_filters to generate similarity between sentences Tr4s.train (Text=text, Speech_tag_ Filter=true, lower=true, Source = ' all_filters ') print ' Abstract: ' print ' \ n '. Join (Tr4s.get_key_sentences (num=3)) # Three most important sentences

The results of the operation are as follows:

Keywords: media/high circle/micro/guest/Zhao Youting/Thank you/Sheenah/show/reporter/New/beijing/BO/display/join/gift/Zhang Jie/night/Dai/hotel/jacket key phrases: Weibo summary: Beijing, December 1, Xinhua (Reporter Zhang Xi) 30th night, high round and Zhao Ting in Beijing to hold a thank feast, Many stars appear to join in, including Zhang Jie (Weibo), Sheenah (Weibo) couple, he Jiong (Weibo), Kevin Tsai (Weibo), Tsui Hark, Zhang Kaili, Huang Xuan (micro bo) and other high-round wearing a pink coat, see a large number of reporters present shy look, Zhao and Ting wearing a cap, very calm, two people trot into the elevator, Did not accept the media interview reporters learned that the high-round, Zhao and the feast of appreciation of the guests nearly hundred people, many of whom are the female high school classmates

In addition, the Jieba participle provides a textrank-based keyword extraction tool. SNOWNLP also implements keyword Extraction and digest generation.

Using the Textrank algorithm to generate keywords and summaries for text

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Using the Textrank algorithm to generate keywords and summaries for text

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Using the Textrank algorithm to generate keywords and summaries for text

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support