Lucene Chinese Word segmentation diagram

Last Update:2015-04-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This document records the use of lucene+paoding diagrams:First, download Lucene (official website:http://archive.apache.org/dist/lucene/java/) This article is used: 2.9.4, download, unzip, Lucene requires the following list of basic jar files: Lucene-core-2.9.4.jar lucene core Jar lucene-analyzers-2.9.4.jar lucene participle jar Lucene-highlighter-2.9.4.jar lucene highlighting Jar second, because the Chinese word segmentation in Lucene can not achieve the functions we need, so we need to download the third party package (blister ding Niu) (official website: http://code.google.com/p/paoding/) The latest version is: Paoding-analysis-2.0.4-beta.zip Download the extracted, Lucene uses the ' blister ding ' required jar file as follows list: Paoding-analysis.jar lucene requires jar for Chinese participle commons-logging.jar log Files {padoding_home}/dic (paoding_home: representative of the extracted paoding directory)third, open Eclipse and create a Java project ( the project name and the path of the project cannot contain spaces ), in this case project Name:paoding1_1: Create a folder--lib (for storing all jars) in paoding Project, and copy the previously mentioned jar files to the Lib directory. and add all the jars under Lib to the project Classpath. 1_2: Copy {paoding_home}/dic directory to paoding project/SRC the entire project chart below: Four, create the Testfileindex.java class, the implementation function is: D:\data\*.txt all the files read into memory, and write to the index directory (d:\luceneindex)Testfileindex.java PackageCom.lixing.paoding.index;

ImportJava.io.BufferedReader;
ImportJava.io.File;
ImportJava.io.FileInputStream;
ImportJava.io.InputStreamReader;

ImportNet.paoding.analysis.analyzer.PaodingAnalyzer;

ImportOrg.apache.lucene.analysis.Analyzer;
ImportOrg.apache.lucene.document.Document;
ImportOrg.apache.lucene.document.Field;
ImportOrg.apache.lucene.index.IndexWriter;
ImportOrg.apache.lucene.store.Directory;
ImportOrg.apache.lucene.store.FSDirectory;

Public classTestfileindex {
Public Static voidMain (string[] args)throwsException {
String datadir="D:/data";
String indexdir="D:/luceneindex";

File[] Files=NewFile (DataDir). Listfiles ();
System.out.println (files.length);

Analyzer analyzer=NewPaodinganalyzer ();
Directory Dir=fsdirectory.open (NewFile (Indexdir));
IndexWriter writer=NewIndexWriter (dir, analyzer, IndexWriter.MaxFieldLength.UNLIMITED);

for(inti=0;i<files.length;i++) {
StringBuffer strbuffer=NewStringBuffer ();
String line= "";
FileInputStream is=NewFileInputStream (Files[i].getcanonicalpath ());
BufferedReader reader=NewBufferedReader (NewInputStreamReader (IS,"gb2312"));
Line=reader.readline ();
while(Line! =NULL){
Strbuffer.append (line);
Strbuffer.append ("\ n");
Line=reader.readline ();
}

Document doc=NewDocument ();
Doc.add (NewField ("FileName", Files[i].getname (), Field.Store.YES, Field.Index.ANALYZED));
Doc.add (NewField ("Contents", Strbuffer.tostring (), Field.Store.YES, Field.Index.ANALYZED));
Writer.adddocument (DOC);
Reader.close ();
Is.close ();
}

Writer.optimize ();
Writer.close ();
Dir.close ();
System.out.println ("OK");
}
}Create Testfilesearcher.java, the real function is to read the contents of the index:Testfilesearcerh.java PackageCom.lixing.paoding.index;

ImportJava.io.File;

ImportNet.paoding.analysis.analyzer.PaodingAnalyzer;

ImportOrg.apache.lucene.analysis.Analyzer;
ImportOrg.apache.lucene.document.Document;
ImportOrg.apache.lucene.queryParser.QueryParser;
ImportOrg.apache.lucene.search.IndexSearcher;
ImportOrg.apache.lucene.search.Query;
ImportOrg.apache.lucene.search.ScoreDoc;
ImportOrg.apache.lucene.search.TopDocs;
ImportOrg.apache.lucene.store.Directory;
ImportOrg.apache.lucene.store.FSDirectory;
ImportOrg.apache.lucene.util.Version;

Public classTestfilesearcher {
Public Static voidMain (string[] args)throwsException {
String Indexdir ="D:/luceneindex";
Analyzer Analyzer =NewPaodinganalyzer ();
Directory dir = Fsdirectory.open (NewFile (Indexdir));
Indexsearcher searcher =NewIndexsearcher (dir,true);
Queryparser parser =NewQueryparser (version.lucene_29,"Contents", analyzer);
Query query = Parser.parse ("Cry for Help");
//term term=new term ("FileName", "university");
//termquery query=new termquery (term);

Topdocs docs=searcher.search (query, 1000);
Scoredoc[] Hits=docs.scoredocs;
System.out.println (hits.length);
for(inti=0;iDocument Doc=searcher.doc (Hits[i].doc);
System.out.print (Doc.get ("FileName")+"--:\n");
System.out.println (Doc.get ("Contents")+"\ n");
}

Searcher.close ();
Dir.close ();
}
}

This article is from the "Li Xin Blog" blog, please be sure to keep this source http://kinglixing.blog.51cto.com/3421535/702663

(turn) Lucene Chinese participle plot

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lucene Chinese Word segmentation diagram

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Lucene Chinese Word segmentation diagram

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support