Lucene 5.2.1 + jcseg 1.9.6 Chinese word Segmentation index (Lucene learning sequence 2)

Source: Internet
Author: User
Tags solr

Lucene 5.2.1 + jcseg 1.9.6 Chinese word Segmentation index (Lucene learning sequence 2)

Jcseg is an open-source Chinese word breaker that is developed using Java and is implemented using the popular MMSEG algorithm. is a separate word breaker, not developed for Lucene, but provides the latest version of Lucene and SOLR word breakers.

Java Code

<span style= "FONT-SIZE:14PX;" >package com.qiuzhping.lucene;import java.sql.connection;import Java.sql.resultset;import java.sql.Statement; Import Org.apache.lucene.analysis.analyzer;import Org.apache.lucene.document.document;import Org.apache.lucene.document.field;import Org.apache.lucene.document.stringfield;import Org.apache.lucene.document.textfield;import Org.apache.lucene.index.directoryreader;import Org.apache.lucene.index.indexreader;import Org.apache.lucene.index.indexwriter;import Org.apache.lucene.index.indexwriterconfig;import Org.apache.lucene.index.indexwriterconfig.openmode;import Org.apache.lucene.queryparser.classic.queryparser;import Org.apache.lucene.search.indexsearcher;import Org.apache.lucene.search.query;import Org.apache.lucene.search.scoredoc;import Org.apache.lucene.search.TopDocs; Import Org.apache.lucene.store.directory;import Org.apache.lucene.store.ramdirectory;import Org.lionsoul.jcseg.analyzer.jcseganalyzer5x;import Org.lionsoul.jcseg.core.JcsegTaskConfig;/** * <description functions in a word> * jcseg[d?? ' Ke ' s?] is an open-source Chinese word breaker developed using Java, using the popular MMSEG algorithm to implement,<br> * and providing the highest version of Lucene, SOLR, Elasticsearch (New) participle interface. <BR> * This program is tested jcseg 1.9.6,lucene:5.2.1<br> * For jcseg details please refer to HTTP://WWW.OSCHINA.NET/P/JCSEG * <detail description> * * @author Peter.qiu * @version [version NO, July 31, 2015] * @see [related classes/methods] * @since [ Product/module Version] */public class Lucenechinesesplit {public static void main (string[] args) throws Exception {Analyz ER Analyzer = new jcseganalyzer5x (jcsegtaskconfig.complex_mode);//non-mandatory (for modifying the default configuration): Gets the word breaker configuration instance jcseganalyzer5x Jcseg = ( jcseganalyzer5x) Analyzer; Jcsegtaskconfig config = jcseg.gettaskconfig ();//append synonyms to the word breaker result, you need to configure jcseg.loadsyn= in Jcseg.properties 1config.setappendcjksyn (TRUE);//append pinyin to the word segmentation result, you need to configure Jcseg.loadpinyin=1config.setappendcjkpinyin in Jcseg.properties ( true)///For more configuration, please see Com.webssky.jcseg.core.JcsegTaskConfig class//= = = Index//Establish memory Index Object Directory directory = new RamdirectOry (); Indexwriterconfig iwconfig = new Indexwriterconfig (analyzer); Iwconfig.setopenmode (Openmode.create_or_append) ; IndexWriter iwriter = new IndexWriter (directory, iwconfig); Connection conn = Querydatafromdb.getconnection (); Statement st = Conn.createstatement (); Long Count = 0;for (int i = 0; i <; i + +) {String query = "SELECT * from Studen T limit "+ I * 100000+", "+ 100000; ResultSet result = st.executequery (query), while (Result.next ()) {Document document = new Document ();d Ocument.add (new Stringfield ("id", result.getstring ("id"), Field.Store.YES));d Ocument.add (New TextField ("name", Result.getstring (" Name "), Field.Store.YES));d Ocument.add (New Stringfield (" Math ", result.getstring (" math "), Field.Store.YES)); Iwriter.adddocument (document), Count + +;}} SYSTEM.OUT.PRINTLN ("Total record:" +count); Iwriter.commit (); Iwriter.close ();/= = Search Indexreader Ireader = Directoryreader.open (directory); Indexsearcher isearcher = new Indexsearcher (Ireader); String keyword = "hello";//Use the Queryparser query parser to construct your query objectQueryparser QP = new Queryparser ("name", analyzer); Qp.setdefaultoperator (Queryparser.and_operator); Query query = qp.parse (keyword); SYSTEM.OUT.PRINTLN ("query =" + query), long start = System.currenttimemillis ();//search for the highest similarity of 2 records SYSTEM.OUT.PRINTLN (" Search for the highest similarity of 2 records "); Topdocs Topdocs = isearcher.search (query, 2); System.out.println ("Hit:" + Topdocs.totalhits); for (Scoredoc Sd:topDocs.scoreDocs) {Document doc = Isearcher.doc (sd.doc ); SYSTEM.OUT.PRINTLN ("ID:" + doc.get ("id")); System.out.println ("Name:" + doc.get ("name")); System.out.println ("math:" + doc.get ("math"));} System.out.println ("Spend time:" + (System.currenttimemillis ()-start) + "MS");}} </span>
Test Results:

Total record:1000000
Query = Name: Hello
Search for 2 records with the highest similarity
Hits: 1000000
Id:1
Name: Hello
math:38
Id:2
Name: Hello
Math:21
Spend time:52 ms

The code snippet involves:

Lucene-analyzers-common-5.2.1.jar

lucene-core-5.2.1. jar

lucene-queryparser-5.2.1. jar

Mysql-connector-java-5.1.35.jar

Jcseg-analyzer-1.9.6.jar

Jcseg-core-1.9.6.jar

You can refer to the previous article for database operations:

Lucene 4.10 + Mysql 5.5 Create database table index (Lucene learning sequence 1)

Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Lucene 5.2.1 + jcseg 1.9.6 Chinese word Segmentation index (Lucene learning sequence 2)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.