Lucene 6.0 Extract News hot word top-n

Source: Internet
Author: User
Tags newsfile

First, the demand

Give a news document that counts the most frequently occurring words.

Second, the idea

There are many algorithms for extracting text keywords, and there are more than one open source tools. This article only describes how to extract the topn of the term frequency from the Lucene index. The essence of the indexing process is the process of an entry-based inverted index, in which the entry removes punctuation, disables words, and so on, and finally produces a term. The idea of implementing in code is to get the terms of a field in a document using Indexreader's Gettermvector, and get the TF (termfrequency) from terms. The TF that gets the word item is then placed in the map in descending order, removing the top-n.

Third, the code implementation

The project catalog is as follows:

Refer to http://blog.csdn.net/napoay/article/details/51911875 for information on how to use IK participle in Lucene 6.0. There are only 2 important classes in the project, Indexdocs.java and Gettopterms.java.

3.1 Index News

On Baidu News randomly found a news: Kai-fu Lee: unmanned driving into the Golden Age Ai has a huge investment opportunity, the news content for the AI Lee's keynote speech. Put the text content of the news in the Testfile/news.txt file. Content in Indexdocs:

Package Lucene. Test;Import Java. IO. BufferedReader;Import Java. IO. File;Import Java(i). FileReader;Import Java. IO. IOException;Import Java. NiO. File. Paths;import org. Apache. Lucene. Analysis. Analyzer;import org. Apache. Lucene. Document. Document;import org. Apache. Lucene. Document. Field;import org. Apache. Lucene. Document. FieldType;import org. Apache. Lucene. Index. Indexoptions;import org. Apache. Lucene. Index. IndexWriter;import org. Apache. Lucene. Index. Indexwriterconfig;import org. Apache. Lucene. Index. Indexwriterconfig. OpenMode;import org. Apache. Lucene. Store. Directory;import org. Apache. Lucene. Store. Fsdirectory;Import Lucene. IK. Ikanalyzer6x;public class Indexdocs {public static void main (string[] args) throws IOException {file Newsfile = new File ("Testfile/news.txt");String Text1 = texttostring (newsfile);Analyzer Smcanalyzer = new Smartchineseanalyzer (TRUE);Analyzer Smcanalyzer = new ikanalyzer6x (TRUE);Indexwriterconfig indexwriterconfig = new Indexwriterconfig (Smcanalyzer);Indexwriterconfig. Setopenmode(OpenMode. CREATE);Storage path of the index directory directory = NULL;Index additions and deletions changed by IndexWriter create indexwriter indexwriter = null;Directory = Fsdirectory. Open(Paths. Get("Indexdir"));IndexWriter = new IndexWriter (directory, Indexwriterconfig);New FieldType, specifying the information for the field index FieldType type = new FieldType ();Save document, Word item frequency, Position information, offset information at index type. Setindexoptions(indexoptions. DOCS_and_freqs_and_positions_and_offsets);Type. setstored(true);//The original string is all saved in the indexType. Setstoretermvectors(true);//Store Word Item amountType. settokenized(true);//Entry ofDocument Doc1 = new Document ();Field field1 = new Field ("Content", Text1, type);Doc1. Add(field1);IndexWriter. Adddocument(Doc1);IndexWriter. Close();Directory. Close();public static String texttostring (file file) {StringBuilder result = new StringBuilder ();try {bufferedreader br = new BufferedReader (new FileReader (file));//Construct a BufferedReader class to read the fileString str = NULL;while (str = BR. ReadLine()) = null) {//Use the ReadLine method to read one line of result at a time. Append(System. LineSeparator() + str);} BR. Close();} catch (Exception e) {E. Printstacktrace();} return result. toString();}}
3.2 Getting hot words
Package Lucene. Test;Import Java. IO. IOException;Import Java. NiO. File. Paths;Import Java. Util. ArrayList;Import Java. Util. Collections;Import Java. Util. Comparator;Import Java. Util. HashMap;Import Java. Util. List;Import Java. Util. Map;Import Java. Util. Map. Entry;import org. Apache. Lucene. Index. Directoryreader;import org. Apache. Lucene. Index. Indexreader;import org. Apache. Lucene. Index. Terms;import org. Apache. Lucene. Index. Termsenum;import org. Apache. Lucene. Store. Directory;import org. Apache. Lucene. Store. Fsdirectory;import org. Apache. Lucene. Util. Bytesref;public class Gettopterms {public static void main (string[] args) throws IOException {//Directory Direc Tory = Fsdirectory. Open(Paths. Get("Indexdir"));Indexreader reader = Directoryreader. Open(directory);Because only one document is indexed, the DOCID is0, gets the word entry for the Content field by Gettermvector Terms Terms = Reader. Gettermvector(0,"Content");Traverse Word Item Termsenum termsenum = Terms. Iterator();Bytesref thisterm = null;map<string, integer> map = new hashmap<string, integer> ();while ((Thisterm = Termsenum. Next()) = null) {//Word entry String Termtext = Thisterm. UTF8ToString ();Get Word Item frequency map by Totaltermfreq () method. Put(Termtext, (int) termsenum. Totaltermfreq());}//Sort by value List<map. Entry<string, integer>> sortedmap = new Arraylist<map. Entry<string, integer>> (map. EntrySet());Collections. Sort(SortedMap, New Comparator<map. Entry<string, integer>> () {public int compare (MAP. Entry<string, integer> O1, Map. Entry<string, integer> O2) {return (O2. GetValue()-O1. GetValue());}        });System. out. println(SortedMap);GETTOPN (SortedMap,Ten);}//Get top-n public static void Gettopn (List<entry<string, integer>> sortedmap, int n) {for (int i =0; i < N; i++) {System. out. println(SortedMap. Get(i). GetKey() +":"+ SortedMap. Get(i). GetValue());}    }}
Iv. Results Analysis 4.1SmartChineseAnalyzer extract Results

The result of the first time is the result of using Lucene's own smartchineseanalyzer participle of top-10. It is clear that the result is not what we expected.

4.2IKAnalyzer Extract Results

With IK participle, the result is still bad. The problem is that there are too many stop words, that is, me, this, the frequency of this word is very high, but it does not make sense.

4.3ik+ Extended Stop Thesaurus extract results

Download hit Chinese stop word list, add to Src/stopword.dic, re-index the document, run Gettopterms.java again. The results are as follows:

The content of the news is Kai-Fu Lee's speech on artificial intelligence, unmanned driving, AI, this time the extraction results are more reliable.

V. References

1, in Lucene 4, Indexreader.gettermvector (DocID, fieldName) returns null for every doc
2. Lucene Index Process Analysis

Vi. Source Code

Welcome criticism.
Need source code can join Lucene, ES, Elk Development Exchange Group (370734940) download.

Lucene 6.0 Extract News hot word top-n

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.