The Textrank algorithm is based on PageRank, which is used to generate keywords and summaries for text. The papers are:
mihalcea R, Tarau P. textrank:bringing order into texts[c]. Association for Computational Linguistics, 2004.
Start with the PageRank.
PageRank
PageRank first used to calculate the importance of Web pages. The whole www can be regarded as a graph with a map, and a node is a webpage. If page A has a link to page B, there is a directed edge from page A to page B.
After you have finished building the diagram, use the following formula:
S (Vi) is the importance of the Web page I (pr value). D is the damping factor, typically set to 0.85. In (Vi) is a collection of web pages that have links to page I. Out (VJ) is a collection of links to Web pages where links exist in page J. | Out (VJ) | is the number of elements in the collection.
PageRank need to use the above formula multiple iterations to get the results. Initially, you can set the importance of each page to 1. The above formula equals to the left side of the result is the PR value of page I after the iteration, the PR value to the right of the equal sign is all before the iteration.
As an example:
Represents the link relationship between three pages, intuitively Web page A is the most important. You can get the following table:
End \ Start
|
A
|
B
|
C
|
A
|
0
|
1
|
1
|
B
|
0
|
0
|
0
|
C
|
0
|
0
|
0
|
The bars represent the nodes in fact, and the columns represent the end nodes. If there is a link relationship between two nodes, the corresponding value is 1.
According to the formula, each vertical column needs to be normalized (the sum of each element/element), and the normalized result is:
End \ Start
|
A
|
B
|
C
|
A
|
0
|
1
|
1
|
B
|
0
|
0
|
0
|
C
|
0
|
0
|
0
|
The above results form the matrix M. We use MATLAB to iterate 100 times to see the importance of each last page:
M = [0 1 1 0 0 0 0 0 0]; PR = [1; 1; 1];for iter = 1:100 PR = 0.15 + 0.85*m*pr; DISP (ITER); Disp (PR); end
Run results (omitted):
...... 0.4050 0.1500 0.1500 0.4050 0.1500 0.1500 0.1500 0.4050 0.1500 98 0.4050 0.1500 0.1500 100 0.4050 0.1500 0.1500 0.4050 0.1500 0.1500
The PR value of final A is 0.4050,b and C is 0.1500.
If the upper side is considered to be non-directional (in fact, it is bidirectional), then:
M = [0 1 1 0.5 0 0 0.5 0 0]; PR = [1; 1; 1];for iter = 1:100 PR = 0.15 + 0.85*m*pr; DISP (ITER); Disp (PR); end
Run results (omitted):
..... 98 1.4595 0.7703 0.7703 1.4595 0.7703 0.7703 0.7703 0.7703
The importance of a, B and C can still be judged.
Extracting keywords using textrank
Splits the original text into sentences, filters out the inactive words in each sentence (optional), and retains only words of the specified part of speech (optional). This allows you to get a set of sentences and a set of words.
Each word acts as a node in the PageRank. Set the window size to K, assuming that a sentence consists of the following words in turn:
W1, W2, W3, W4, W5, ..., WN
W1, W2, ..., wk,W2, W3, ..., wk+1, W3, W4,..., wk+2 , etc. are all a window. There is an no-right edge between the nodes that correspond to any two words in a window.
Based on the diagram above, it is possible to calculate the importance of each word node. Some of the most important words can be used as keywords.
Extracting key phrases using Textrank
Refer to "Extracting keywords using textrank" to extract several keywords. There are several keywords in the Wakahara text that are adjacent to each other, then these keywords can form a key phrase.
For example, in a "Support vector machine" article, you can find three keywords support, vector, machine , through the key phrase extraction, you can get support vector machine .
Extract abstracts using Textrank
Each sentence is considered as a node in the graph, if there is similarity between the two sentences, it is believed that the corresponding two nodes have a non-direction right side, and the weights are similar degrees.
Some of the most important sentences calculated by the PageRank algorithm can be used as summaries.
This paper uses the following formula to calculate the similarity between Si and SJ in two sentences:
A molecule is the number of words that appear in two sentences. Si| is the number of words in sentence i.
Since it is the right graph, the PageRank formula is modified slightly:
Implement Textrank
Because we want to test a variety of situations, I implemented a Python 2.7-based Textrank for Chinese text library Textrank4zh. In:
Https://github.com/someus/TextRank4ZH
Here is an example:
#-*-encoding:utf-8-*-import codecsfrom textrank4zh import textrank4keyword, Textrank4sentencetext = Codecs.open ('./ Text/01.txt ', ' r ', ' Utf-8 '). Read () tr4w = Textrank4keyword (stop_words_file= './stopword.data ') # import Stop word # Use speech filtering, Text lowercase, Windows 2tr4w.train (Text=text, Speech_tag_filter=true, Lower=true, window=2) print ' keywords: ' # 20 keywords each with a minimum length of 1print '/'. Join (Tr4w.get_keywords (word_min_len=1)) print ' key phrase: ' # 20 keywords to construct a phrase, The phrase appears in the original text in the minimum number of 2print '/'. Join (Tr4w.get_keyphrases (keywords_num=20, min_occur_num= 2)) Tr4s = Textrank4sentence ( Stop_words_file= './stopword.data ') # Use speech filtering, text lowercase, use words_all_filters to generate similarity between sentences Tr4s.train (Text=text, Speech_tag_ Filter=true, lower=true, Source = ' all_filters ') print ' Abstract: ' print ' \ n '. Join (Tr4s.get_key_sentences (num=3)) # Three most important sentences
The results of the operation are as follows:
Keywords: media/high circle/micro/guest/Zhao Youting/Thank you/Sheenah/show/reporter/New/beijing/BO/display/join/gift/Zhang Jie/night/Dai/hotel/jacket key phrases: Weibo summary: Beijing, December 1, Xinhua (Reporter Zhang Xi) 30th night, high round and Zhao Ting in Beijing to hold a thank feast, Many stars appear to join in, including Zhang Jie (Weibo), Sheenah (Weibo) couple, he Jiong (Weibo), Kevin Tsai (Weibo), Tsui Hark, Zhang Kaili, Huang Xuan (micro bo) and other high-round wearing a pink coat, see a large number of reporters present shy look, Zhao and Ting wearing a cap, very calm, two people trot into the elevator, Did not accept the media interview reporters learned that the high-round, Zhao and the feast of appreciation of the guests nearly hundred people, many of whom are the female high school classmates
In addition, the Jieba participle provides a textrank-based keyword extraction tool. SNOWNLP also implements keyword Extraction and digest generation.
Using the Textrank algorithm to generate keywords and summaries for text