Analysis of HANLP keyword extraction algorithm
- Reference paper: "Textrank:bringing Order into texts"
- Java implementation of extracting key words from Textrank algorithm
- Java implementation of automatic summarization of Textrank algorithm the author of this article probably explained the Textrank formula
1. Thesis
In this paper, we introduce the TextRank graphbased ranking model for graphs extracted from naturallanguage texts
Textrank is an unsupervised learning algorithm that constructs text into a graph, treats objects of interest in text (such as word breakers) as vertices, and then applies the textrank algorithm to extract some of the information in the text.
Such keywords may constitute useful entries for building an automatic index for a document collection, can be used to classify a text, or may serve as a concise summary for a given document.
The extracted keywords can be used as text classification, or to summarize the central idea of the text.
Textrank extracts keywords by iterating over and over, each iteration of the algorithm, scoring the vertices in the graph. Until a certain condition is met (for example, the number of iterations is 200 times, or a parameter is set to a threshold value).
For loosely connected graphs, with the number of edges proportional with the number of vertices,undirected graphs tend to have more gradual convergence curves.
For sparse graphs, the number of edges is linearly related to the number of vertices, and the keyword extraction of such graphs has a more gradual convergence curve (or is called slow convergence)
It may be therefore useful to indicate and incorporate into the model the “strength”of the connection between two vertices $V_i$ and $V_j$ as a weight $w_{ij}$ added to the corresponding edge that connects the two vertices.
Sometimes, the relationship between vertices in the graph is not completely equal, for example, some vertices are closely related, where the weights of the edges are used to measure the importance of correlation between vertices, and this is the weighted graph model.
2. Source code Implementation 2.1 keyword extraction process
Given a number of sentences, extract the keywords. And the Textrank algorithm is graphbased ranking model, so it is necessary to construct a diagram, to construct a diagram, you need to determine how the vertices of the graph is constructed, so the sentence is divided into words, each word as the vertex of the graph.
When you select a word as the vertex of a graph, you can apply some filtering rules: for example, to remove the stop words in the word segmentation results, add vertices according to the part of speech (such as just nouns and verbs as the vertices of the graph) ...
The vertices added to the graph can be restricted with syntactic filters, which select only lexical units of a certain part of speech. One can for instance consider only nouns and verbs for addition to the graph, and consequently draw potential edges based only on relations that can be established between nouns and verbs.
After determining which words are the vertices of the graph, the other is to determine the relationship between the word and the word, namely: which vertices in the graph have edges? For example, to set a window size, the words falling within this window, all add an edge.
it is the application that dictates the type of relations that are used to draw connections between any two such vertices,
After determining the relationship of the edges, the next step is to determine the edge weights. This also depends on the actual situation.
2.2 Determining the adjacency point of a word based on window size
As mentioned earlier, a number of words after the word participle, get a word, or called term. Assume that the window size is 5. Explain the Java implementation of the Textrank algorithm to extract the key words in the article how to determine which of a term has adjacency term.
For example, the term ' programmer ', which appears in multiple sentences, results in the word "programmer" appearing in four places:
At index 0: the ' Programmer ' adjacency points are:
English, programmer, engaging, procedure
At index 9: the ' Programmer ' adjacency points are:
Development, maintenance, professional, personnel, divide, program, design, personnel
At index 26, the ' Programmer ' adjacency points are:
China, software, practitioners, divided, senior, programmer, System analyst, project Manager
At index 28, the ' Programmer ' adjacency points are:
Practitioners, divided into, programmer, senior, System analyst, project manager, four
Combining all the words in this window, get the ' programmer ' adjacency points as follows:
Therefore, when the window size is set to 5 o'clock, the front and back four term of term will be regarded as its adjacency point, and when it appears multiple times, it is the first and second of each occurrence position of the 4 term merge, and eventually as the term's adjacency point.
As can be seen here: if a term in a sentence appears more than once, it means that the term will be more important. Because it has more neighboring points, that is, there are many other term to vote for it. This is somewhat similar to the term frequency to measure the importance of term.
2.3-Score (score) update algorithm
m.put(key, m.get(key) + d / size * (score.get(element) == null ? 0 : score.get(element)));
Interpretation of the code:
m.get(key)
If it is the first entry for (String element : value)
, it is the 1-d
result of getting the first half of the formula, and if the iteration is already in for (String element : value)
progress, the For loop is equivalent to summing:\ (\sigma_{v_j\in in (v_i)}\)
for (String element : value) { int size = words.get(element).size(); m.put(key, m.get(key) + d / size * (score.getnull0 : score.get(element)));}
With "What he said," For example: Select Window size 5, after participle and remove the stop word:
The structure of the graph is as follows: (the weight of each edge is 1)
Take the vertex ' rationale ' as an example to see how the ' rationale ' score is updated . In for (String element : value)
a total of two vertices to the ' reason ' to vote, the first is the ' true ' vertex, and the ' true ' vertex adjacency Vertex has two, so: int size = words.get(element).size();
size=2. Next, let's break down this line of code:
m.put(key, m.get(key) + d / size * (score.getnull0 : score.get(element)))
m.get(key)
Is 1-d, because in the outer for Loop, the m.put(key, 1 - d)
first half of the formula (1-d) is stored.
score.get(element) == null ? 0 : score.get(element)
This is the result of getting the previous iteration. For the initial first iteration, which score.get(element)
is 0.8807971, this value is the initial value of the score for each vertex:
//依据TF来设置初值, words 代表的是 一张 无向图 for (Map.Entry<String, Set<String>> entry : words.entrySet()) { score.put(entry.getKeysigMoid(entry.getValue().size()));//无向图的每个顶点 得分值 初始化 }
score.get(element)
Equivalent to \ (WS (V_j) \) in the formula
- Finally, to analyze a size,size is obtained by code
int size = words.get(element).size()
, because each edge weight value is 1, size is actually equivalent to:\ (\sigma_{v_k\in out (v_j)}w_{jk}\).
In (' RB ') ={' true ', ' say '}
When \ (v_j\) is ' true ',\ (out (V_j) \) {' Say ', ' RB '}, so:\ (\sigma_{v_k\in out (v_j)}w_{jk}=2\). So, update the vertex ' reason ' score:\ (1-d+d* (*0.8807971=0.5243387\)). The temporary results are then saved through M.put.
Next, for (String element : value)
continue, at this point:\ (v_j\) for vertex ' say ', because vertex ' says ' There are also two adjacent edges, so there is:\ (\sigma_{v_k\in out (v_j)}w_{jk}=2\). So update the vertex ' reason ' score:\ (0.5243387+d* (*0.8807971=0.89867747\)). And this is the score of vertex ' RB ' at the first iteration.
According to the steps in 1, 2 above, for (String element : value)
it is equivalent to:\ (\sigma_{v_j\in in (v_i)}\), because each time the calculated results are put back to hashmap m.
So, in the first iteration, the vertex ' rationale ' scored: 0.89867747
Similar to, passed: Max_iter iterations, or threshold values reached:
if (max_diff <= min_diff) break;
, it is no longer iterative.
Here's a general description of the code:
Here is the process of constructing the graph without direction
for(String w:wordlist) {if(!words.ContainsKey(W)) {//Exclude duplicate term in wordList, for each of the removed term, save the term's adjacency vertex with treeset<string>Words.put(W,NewTreeset<string> ()); }//Complexity O (n-1) if(Que.size() >=5) {The size of the window is 5 and is written dead. For a term_a, its first 4 term, the last 4 terms belong to the Term_a adjacency pointQue.Poll(); } for(String Qword:que) {if(W.equals(QWord)) {Continue; }//Since it is a neighbor, then the relationship is mutual, traverse once canWords.Get(W).Add(QWord); Words.Get(QWord).Add(w); } Que. Offer(w); }
Here is an initial score procedure for assigning a value to each vertex in the diagram:
new HashMap<String, Float>();//保存最终每个关键词的得分 //依据TF来设置初值, words 代表的是 一张 无向图 for (Map.Entry<String, Set<String>> entry : words.entrySet()) { score.put(entry.getKeysigMoid(entry.getValue().size()));//无向图的每个顶点 得分值 初始化 }
Next, three for loops: the first for loop represents the number of iterations, and the second for loop represents: Calculates the score for each vertex in the non-aligned graph; The third for loop represents: for a specific vertex, calculate the voting weight of each of its adjacency points.
for (int0; i < max_iter; ++i) { //.... for (Map.Entry<String, Set<String>> entry : words.entrySet()) { //... for (String element : value) {
In this way, the formula in the paper is realized:
\[ws (v_i) = (1-d) +d*\sigma_{v_j\in in (v_i)}\frac{w_{ji}}{\sigma_{v_k\in out (V_j)}w_{jk}}*ws (V_j) \]
And the key words to be extracted are:
[RB, indeed, say]
It just uses the phrase "he said it's true" to demonstrate the specifics of the Textrank algorithm, which may be unreasonable in practice. Because there will be:
- The existing statistical information is not enough to allow Textrank to support the importance of a word, the algorithm has limitations.
Visible: Textrank extraction keywords are affected by the word segmentation results, and second, also affected by the window size. Although the code is generally understood, but there are some questions: for example, why use the above formula, high-scoring words are keywords? According to Textrank keyword and term frequency to find out what is the advantage of keywords? Select which words in the text establish the model as the vertex of the graph? What kind of relationship is based on the text as an edge of the graph?
Original: https://www.cnblogs.com/hapjin/p/9157515.html
Analysis of HANLP keyword extraction algorithm