Java implementation of extracting key words from Textrank algorithm

Source: Internet
Author: User
Tags idf

Reprint: Code farm»textrank algorithm to extract the Java implementation of key words

Talking about automatic summarization algorithm, common and most easily realized when belong to TF-IDF, but feel tf-idf effect general, inferior to textrank good.

Textrank is inspired by Google's PageRank algorithm, which aims at automatic summarization of the weighted algorithm for sentence design in text. It uses the principle of voting, so that each word to its neighbors (the term window) vote in favour of the vote, the weight of the vote depends on the number of votes. This is a "first chicken or Egg first" paradox, PageRank using matrix iterative convergence approach to solve this paradox. Textrank is no exception:

PageRank Calculation formula:

The formal Textrank formula

The formal Textrank formula, based on the formula of PageRank, introduces the concept of the weight of the edge, representing the similarity of the two sentences.

But obviously I just want to calculate the keyword, if a word as a sentence, then all the sentences (words) constitute the weight of the edge is 0 (no intersection, no similarity), so the weight of the numerator denominator w is about to drop, the algorithm degenerate to PageRank. So, it is said that the keyword extraction algorithm is not too pagerank.

In addition, if you want to extract the key sentence (automatic summary), please refer to the sister article "Textrank algorithm automatic summary of Java implementation."

Java implementation of Textrank

First look at the test data:

Programmers (English programmer) are professionals engaged in program development and maintenance. Programmers are generally divided into program designers and program coders, but the boundaries are not very clear, especially in China. The software practitioners are divided into four categories: Junior programmer, Senior programmer, System analyst and project manager.

I took out the Baidu encyclopedia on the definition of "programmer" as a test case, it is clear that the definition of the keyword should be "programmer" and "programmer" should score the highest.

First of all, this sentence participle, here can help with a variety of word segmentation items, such as ANSJ participle, to obtain the results:

[Programmer/N, (, English/nz, programmer/en,), is/V, engages in/V, program/n, Development/V,,/w, Maintenance/V,/uj, PRO/N, Personnel/N,. /w, General/A, will/d, programmer/N, divided into/V, program/n, design/vn, Personnel/n, and/C, program/n, coded/n, Personnel/N,/w, but/C, both/R,/uj, boundary/N, and/C, not/d, very/d, clear/A,, W, special/D, is/V, in/p, China/ns,. /w, software/n, practitioners/b, personnel/n, divided into/V, beginner/b, programmer/N,,/w, Advanced/A, programmer/n,,/w, System/N, analyst/N, and/C, project/N, manager/N, four/M, big///, class/Q,. /w]

Then remove the stop words inside, here I removed the punctuation, common words, and "nouns, verbs, adjectives, adverbs and other words." To derive practical useful words:

[Programmer, English, program, development, maintenance, professional, personnel, programmer, divided, program, design, personnel, program, coding, personnel, boundaries, special, China, software, personnel, divided, programmer, Senior, programmer, System, analyst, project, manager]

After that, create two windows with a size of 5, each of which will vote for words within 5 of the distance behind it:

{Development =[Professional, programmer, Maintenance, English, program, people],

  software =[programmer, divide, limit, advanced, China, special, personnel],

  programmer =[Development, software, Analyst, maintenance, system, project, manager, divided into, English, program, Professional, design, advanced, personnel, China],

  analyst =[Programmer, System, project, manager, advanced],

  Maintenance =[Professional, development, Programmers, divided into, English, programs, people],

  Systems =[programmers, analysts, projects, managers, divided into, advanced],

  Project =[Programmers, analysts, systems, managers, advanced],

  Manager =[programmer, analyst, System, Project],

  divided into =[Professional, software, design, programmer, maintenance, System, advanced, program, China, special, personnel],

  English =[Professional, development, programmer, maintenance, Program],

  Program =[Professional, development, design, programmer, coding, maintenance, boundaries, divided into, English, special, personnel],

  Special =[Software, coding, Division, boundaries, programs, China, people],

p>  Professional =[Development, programmer, maintenance, divided into, English, program, personnel],

  Design =[programmer, Code, divide, program, people],

  coding =[Design, boundaries, programs, China, special, personnel ],

  bounds =[software, coding, programs, China, special, people],

  Advanced =[programmer, software, analyst, System, project, division, personnel],

  Chinese =[programmer, Software, Coding, dividing, dividing, special, people],

  personnel =[Development, programmer, software, maintenance, Division, program, special, Professional, design, coding, boundaries, advanced, China]}

Then start iterating over the polls:

1234567891011121314151617181920         for(inti = 0; i < max_iter; ++i)        {            Map<String, Float> m = new HashMap<String, Float>();            floatmax_diff = 0;            for(Map.Entry<String, Set<String>> entry : words.entrySet())            {                String key = entry.getKey();                Set<String> value = entry.getValue();                m.put(key, 1- d);                for (String other : value)                {                    intsize = words.get(other).size();                    if(key.equals(other) || size == 0continue;                    m.put(key, m.get(key) + d / size * (score.get(other) == null0: score.get(other)));                }                max_diff = Math.max(max_diff, Math.abs(m.get(key) - (score.get(key) == null0: score.get(key))));            }            score = m;            if(max_diff <= min_diff) break;        }
Poll results after sorting:

[Programmer = 1.9249977,

person = 1.6290349,

Divided into = 1.4027836,

program = 1.4025855,

Advanced = 0.9747374,

Software =0.93525416,

China = 0.93414587,

Special = 0.93352026,

Maintenance = 0.9321688,

Professional =0.9321688,

System =0.885048,

Code = 0.82671607,

Bounds = 0.82206935,

Development = 0.82074183,

Analyst = 0.77101076,

Project =0.77101076,

English =0.7098714,

Design = 0.6992446,

Manager = 0.64640945]

The programmer is the top of the league, and scores are differentiated, well, barely.

Java implementation of extracting key words from Textrank algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.