Talking about automatic summarization algorithm, common and most easily realized when belong to TF-IDF, but feel tf-idf effect general, inferior to textrank good.
Textrank is inspired by Google's PageRank algorithm, which aims at automatic summarization of the weighted algorithm for sentence design in text. It uses the principle of voting, so that each word to its neighbors (the term window) vote in favour of the vote, the weight of the vote depends on the number of votes. This is a "first chicken or Egg first" paradox, PageRank using matrix iterative convergence approach to solve this paradox. Textrank is no exception:
PageRank Calculation formula:
650) this.width=650; "alt=" bubuko.com, Bubu Buckle "src=" http://ww4.sinaimg.cn/large/6cbb8645gw1eeti85tt02j20jj03xmx5.jpg " Width= "703" height= "141" style= "padding:0px;"/>
The formal Textrank formula
The formal Textrank formula, based on the formula of PageRank, introduces the concept of the weight of the edge, representing the similarity of the two sentences.
650) this.width=650; "alt=" bubuko.com, Bubu Buckle "src=" http://ww1.sinaimg.cn/large/6cbb8645gw1eetibn86vij20jy03gmx8.jpg " Width= "718" height= "124" style= "padding:0px;"/>
But obviously I just want to calculate the keyword, if a word as a sentence, then all the sentences (words) constitute the weight of the edge is 0 (no intersection, no similarity), so the weight of the numerator denominator w is about to drop, the algorithm degenerate to PageRank. So, it is said that the keyword extraction algorithm is not too pagerank.
In addition, if you want to extract the key sentence (automatic summary), please refer to the sister article "Textrank algorithm automatic summary of Java implementation."
Java implementation of Textrank
First look at the test data:
Programmers (English programmer) are professionals engaged in program development and maintenance. Programmers are generally divided into program designers and program coders, but the boundaries are not very clear, especially in China. The software practitioners are divided into four categories: Junior programmer, Senior programmer, System analyst and project manager.
I took out the Baidu encyclopedia on the definition of "programmer" as a test case, it is clear that the definition of the keyword should be "programmer" and "programmer" should score the highest.
First of all, this sentence participle, here can help with a variety of word segmentation items, such as ANSJ participle, to obtain the results:
[Programmer/N, (, English/nz, programmer/en,), is/V, engages in/V, program/n, Development/V,,/w, Maintenance/V,/uj, PRO/N, Personnel/N,. /w, General/A, will/d, programmer/N, divided into/V, program/n, design/vn, Personnel/n, and/C, program/n, coded/n, Personnel/N,/w, but/C, both/R,/uj, boundary/N, and/C, not/d, very/d, clear/A,, W, special/D, is/V, in/p, China/ns,. /w, software/n, practitioners/b, personnel/n, divided into/V, beginner/b, programmer/N,,/w, Advanced/A, programmer/n,,/w, System/N, analyst/N, and/C, project/N, manager/N, four/M, big///, class/Q,. /w]
Then remove the stop words inside, here I removed the punctuation, common words, and "nouns, verbs, adjectives, adverbs and other words." Come up with useful words: Get Java Big Data High concurrency system framework SPRINGMVC MyBatis Bootstrap HTML5 Shiro maven SSM
[Programmer, English, program, development, maintenance, professional, personnel, programmer, divided, program, design, personnel, program, coding, personnel, boundaries, special, China, software, personnel, divided, programmer, Senior, programmer, System, analyst, project, manager]
After that, create two windows with a size of 5, each of which will vote for words within 5 of the distance behind it:
{Development =[Professional, programmer, Maintenance, English, program, personnel],
Software =[programmers, divided into, boundaries, advanced, China, special, personnel],
Programmer =[Development, software, analyst, maintenance, system, project, manager, divided into, English, program, professional, design, senior, personnel, China],
Analyst =[Programmer, System, project, manager, advanced],
Maintenance =[Professional, development, programmer, divided into, English, program, personnel],
System =[programmer, analyst, project, manager, divided, advanced],
Project =[programmer, analyst, System, manager, advanced],
Manager =[programmer, analyst, System, Project],
Divided into =[Professional, software, design, programmer, maintenance, System, advanced, program, China, special, personnel],
English =[Professional, development, programmer, Maintenance, program],
Program =[Professional, development, design, programmer, coding, maintenance, boundaries, divided into, English, special, personnel],
Special =[software, coding, dividing, boundaries, programs, China, people],
Professional =[Development, programmer, maintenance, divided into, English, program, personnel],
Design =[programmer, Code, divide, program, people],
Coding =[Design, boundaries, procedures, China, special, people],
Boundaries =[software, coding, programs, China, special, people],
Senior =[programmer, software, analyst, System, project, divide, personnel],
China =[programmer, Software, coding, dividing, dividing, special, personnel],
Personnel =[Development, programmer, software, maintenance, divide, program, special, Professional, design, coding, boundaries, advanced, China]}
and then start the iteration poll:
for (int i = 0; i < max_iter; ++i) { map<string , float> m = new hashmap<string, float> (); float max_diff = 0; for (map.entry<string, set<string>> entry : words.entryset ()) { string key = entry.getkey (); set<string> value = entry.getvalue (); m.put (key, 1 - d); for (String other : value) { Int size = words.get (Other). Size (); if (Key.equals (other) | | size == 0) continue; m.put (Key, m.get (key) + d / size * (Score.get (Other) == null ? 0 : score.get (other))); } max_diff = math.max (Max_diff, math.abs (M.get (key) - ( Score.get (Key) == null ? 0 : score.get (key))); } score = m; if (max _diff <= min_diff) break; }
Poll results after sorting:
[Programmer = 1.9249977,
person = 1.6290349,
Divided into = 1.4027836,
program = 1.4025855,
Advanced = 0.9747374,
Software =0.93525416,
China = 0.93414587,
Special = 0.93352026,
Maintenance = 0.9321688,
Professional =0.9321688,
System =0.885048,
Code = 0.82671607,
Bounds = 0.82206935,
Development = 0.82074183,
Analyst = 0.77101076,
Project =0.77101076,
English =0.7098714,
Design = 0.6992446,
Manager = 0.64640945]
Java implementation of extracting key words from Textrank algorithm