Lucene 5.2.1 + jcseg 1.9.6 Chinese word Segmentation index (Lucene learning sequence 2)
Jcseg is an open-source Chinese word breaker that is developed using Java and is implemented using the popular MMSEG algorithm. is a separate word breaker, not developed for Lucene, but provides the latest version of Lucene and SOLR word breakers.
Java Code
<span style= "FONT-SIZE:14PX;" >package com.qiuzhping.lucene;import java.sql.connection;import Java.sql.resultset;import java.sql.Statement; Import Org.apache.lucene.analysis.analyzer;import Org.apache.lucene.document.document;import Org.apache.lucene.document.field;import Org.apache.lucene.document.stringfield;import Org.apache.lucene.document.textfield;import Org.apache.lucene.index.directoryreader;import Org.apache.lucene.index.indexreader;import Org.apache.lucene.index.indexwriter;import Org.apache.lucene.index.indexwriterconfig;import Org.apache.lucene.index.indexwriterconfig.openmode;import Org.apache.lucene.queryparser.classic.queryparser;import Org.apache.lucene.search.indexsearcher;import Org.apache.lucene.search.query;import Org.apache.lucene.search.scoredoc;import Org.apache.lucene.search.TopDocs; Import Org.apache.lucene.store.directory;import Org.apache.lucene.store.ramdirectory;import Org.lionsoul.jcseg.analyzer.jcseganalyzer5x;import Org.lionsoul.jcseg.core.JcsegTaskConfig;/** * <description functions in a word> * jcseg[d?? ' Ke ' s?] is an open-source Chinese word breaker developed using Java, using the popular MMSEG algorithm to implement,<br> * and providing the highest version of Lucene, SOLR, Elasticsearch (New) participle interface. <BR> * This program is tested jcseg 1.9.6,lucene:5.2.1<br> * For jcseg details please refer to HTTP://WWW.OSCHINA.NET/P/JCSEG * <detail description> * * @author Peter.qiu * @version [version NO, July 31, 2015] * @see [related classes/methods] * @since [ Product/module Version] */public class Lucenechinesesplit {public static void main (string[] args) throws Exception {Analyz ER Analyzer = new jcseganalyzer5x (jcsegtaskconfig.complex_mode);//non-mandatory (for modifying the default configuration): Gets the word breaker configuration instance jcseganalyzer5x Jcseg = ( jcseganalyzer5x) Analyzer; Jcsegtaskconfig config = jcseg.gettaskconfig ();//append synonyms to the word breaker result, you need to configure jcseg.loadsyn= in Jcseg.properties 1config.setappendcjksyn (TRUE);//append pinyin to the word segmentation result, you need to configure Jcseg.loadpinyin=1config.setappendcjkpinyin in Jcseg.properties ( true)///For more configuration, please see Com.webssky.jcseg.core.JcsegTaskConfig class//= = = Index//Establish memory Index Object Directory directory = new RamdirectOry (); Indexwriterconfig iwconfig = new Indexwriterconfig (analyzer); Iwconfig.setopenmode (Openmode.create_or_append) ; IndexWriter iwriter = new IndexWriter (directory, iwconfig); Connection conn = Querydatafromdb.getconnection (); Statement st = Conn.createstatement (); Long Count = 0;for (int i = 0; i <; i + +) {String query = "SELECT * from Studen T limit "+ I * 100000+", "+ 100000; ResultSet result = st.executequery (query), while (Result.next ()) {Document document = new Document ();d Ocument.add (new Stringfield ("id", result.getstring ("id"), Field.Store.YES));d Ocument.add (New TextField ("name", Result.getstring (" Name "), Field.Store.YES));d Ocument.add (New Stringfield (" Math ", result.getstring (" math "), Field.Store.YES)); Iwriter.adddocument (document), Count + +;}} SYSTEM.OUT.PRINTLN ("Total record:" +count); Iwriter.commit (); Iwriter.close ();/= = Search Indexreader Ireader = Directoryreader.open (directory); Indexsearcher isearcher = new Indexsearcher (Ireader); String keyword = "hello";//Use the Queryparser query parser to construct your query objectQueryparser QP = new Queryparser ("name", analyzer); Qp.setdefaultoperator (Queryparser.and_operator); Query query = qp.parse (keyword); SYSTEM.OUT.PRINTLN ("query =" + query), long start = System.currenttimemillis ();//search for the highest similarity of 2 records SYSTEM.OUT.PRINTLN (" Search for the highest similarity of 2 records "); Topdocs Topdocs = isearcher.search (query, 2); System.out.println ("Hit:" + Topdocs.totalhits); for (Scoredoc Sd:topDocs.scoreDocs) {Document doc = Isearcher.doc (sd.doc ); SYSTEM.OUT.PRINTLN ("ID:" + doc.get ("id")); System.out.println ("Name:" + doc.get ("name")); System.out.println ("math:" + doc.get ("math"));} System.out.println ("Spend time:" + (System.currenttimemillis ()-start) + "MS");}} </span>
Test Results:
Total record:1000000
Query = Name: Hello
Search for 2 records with the highest similarity
Hits: 1000000
Id:1
Name: Hello
math:38
Id:2
Name: Hello
Math:21
Spend time:52 ms
The code snippet involves:
Lucene-analyzers-common-5.2.1.jar
lucene-core-5.2.1. jar
lucene-queryparser-5.2.1. jar
Mysql-connector-java-5.1.35.jar
Jcseg-analyzer-1.9.6.jar
Jcseg-core-1.9.6.jar
You can refer to the previous article for database operations:
Lucene 4.10 + Mysql 5.5 Create database table index (Lucene learning sequence 1)
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Lucene 5.2.1 + jcseg 1.9.6 Chinese word Segmentation index (Lucene learning sequence 2)