Lucene 5.2.1 + jcseg 1.9.6 Chinese word Segmentation index (Lucene learning sequence 2)

Last Update:2015-07-31 Source: Internet

Author: User

Tags solr

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Lucene 5.2.1 + jcseg 1.9.6 Chinese word Segmentation index (Lucene learning sequence 2)

Jcseg is an open-source Chinese word breaker that is developed using Java and is implemented using the popular MMSEG algorithm. is a separate word breaker, not developed for Lucene, but provides the latest version of Lucene and SOLR word breakers.

Java Code

<span style= "FONT-SIZE:14PX;" >package com.qiuzhping.lucene;import java.sql.connection;import Java.sql.resultset;import java.sql.Statement; Import Org.apache.lucene.analysis.analyzer;import Org.apache.lucene.document.document;import Org.apache.lucene.document.field;import Org.apache.lucene.document.stringfield;import Org.apache.lucene.document.textfield;import Org.apache.lucene.index.directoryreader;import Org.apache.lucene.index.indexreader;import Org.apache.lucene.index.indexwriter;import Org.apache.lucene.index.indexwriterconfig;import Org.apache.lucene.index.indexwriterconfig.openmode;import Org.apache.lucene.queryparser.classic.queryparser;import Org.apache.lucene.search.indexsearcher;import Org.apache.lucene.search.query;import Org.apache.lucene.search.scoredoc;import Org.apache.lucene.search.TopDocs; Import Org.apache.lucene.store.directory;import Org.apache.lucene.store.ramdirectory;import Org.lionsoul.jcseg.analyzer.jcseganalyzer5x;import Org.lionsoul.jcseg.core.JcsegTaskConfig;/** * <description functions in a word> * jcseg[d?? ' Ke ' s?] is an open-source Chinese word breaker developed using Java, using the popular MMSEG algorithm to implement,<br> * and providing the highest version of Lucene, SOLR, Elasticsearch (New) participle interface. <BR> * This program is tested jcseg 1.9.6,lucene:5.2.1<br> * For jcseg details please refer to HTTP://WWW.OSCHINA.NET/P/JCSEG * <detail description> * * @author Peter.qiu * @version [version NO, July 31, 2015] * @see [related classes/methods] * @since [ Product/module Version] */public class Lucenechinesesplit {public static void main (string[] args) throws Exception {Analyz ER Analyzer = new jcseganalyzer5x (jcsegtaskconfig.complex_mode);//non-mandatory (for modifying the default configuration): Gets the word breaker configuration instance jcseganalyzer5x Jcseg = ( jcseganalyzer5x) Analyzer; Jcsegtaskconfig config = jcseg.gettaskconfig ();//append synonyms to the word breaker result, you need to configure jcseg.loadsyn= in Jcseg.properties 1config.setappendcjksyn (TRUE);//append pinyin to the word segmentation result, you need to configure Jcseg.loadpinyin=1config.setappendcjkpinyin in Jcseg.properties ( true)///For more configuration, please see Com.webssky.jcseg.core.JcsegTaskConfig class//= = = Index//Establish memory Index Object Directory directory = new RamdirectOry (); Indexwriterconfig iwconfig = new Indexwriterconfig (analyzer); Iwconfig.setopenmode (Openmode.create_or_append) ; IndexWriter iwriter = new IndexWriter (directory, iwconfig); Connection conn = Querydatafromdb.getconnection (); Statement st = Conn.createstatement (); Long Count = 0;for (int i = 0; i <; i + +) {String query = "SELECT * from Studen T limit "+ I * 100000+", "+ 100000; ResultSet result = st.executequery (query), while (Result.next ()) {Document document = new Document ();d Ocument.add (new Stringfield ("id", result.getstring ("id"), Field.Store.YES));d Ocument.add (New TextField ("name", Result.getstring (" Name "), Field.Store.YES));d Ocument.add (New Stringfield (" Math ", result.getstring (" math "), Field.Store.YES)); Iwriter.adddocument (document), Count + +;}} SYSTEM.OUT.PRINTLN ("Total record:" +count); Iwriter.commit (); Iwriter.close ();/= = Search Indexreader Ireader = Directoryreader.open (directory); Indexsearcher isearcher = new Indexsearcher (Ireader); String keyword = "hello";//Use the Queryparser query parser to construct your query objectQueryparser QP = new Queryparser ("name", analyzer); Qp.setdefaultoperator (Queryparser.and_operator); Query query = qp.parse (keyword); SYSTEM.OUT.PRINTLN ("query =" + query), long start = System.currenttimemillis ();//search for the highest similarity of 2 records SYSTEM.OUT.PRINTLN (" Search for the highest similarity of 2 records "); Topdocs Topdocs = isearcher.search (query, 2); System.out.println ("Hit:" + Topdocs.totalhits); for (Scoredoc Sd:topDocs.scoreDocs) {Document doc = Isearcher.doc (sd.doc ); SYSTEM.OUT.PRINTLN ("ID:" + doc.get ("id")); System.out.println ("Name:" + doc.get ("name")); System.out.println ("math:" + doc.get ("math"));} System.out.println ("Spend time:" + (System.currenttimemillis ()-start) + "MS");}} </span>

Test Results:

Total record:1000000
Query = Name: Hello
Search for 2 records with the highest similarity
Hits: 1000000
Id:1
Name: Hello
math:38
Id:2
Name: Hello
Math:21
Spend time:52 ms

The code snippet involves:

Lucene-analyzers-common-5.2.1.jar

lucene-core-5.2.1. jar

lucene-queryparser-5.2.1. jar

Mysql-connector-java-5.1.35.jar

Jcseg-analyzer-1.9.6.jar

Jcseg-core-1.9.6.jar

You can refer to the previous article for database operations:

Lucene 4.10 + Mysql 5.5 Create database table index (Lucene learning sequence 1)

Lucene 5.2.1 + jcseg 1.9.6 Chinese word Segmentation index (Lucene learning sequence 2)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More