Lucene+ikanalyzer Implement Chinese synonyms search

Source: Internet
Author: User
Tags createindex

Lucene implements index creation and retrieval; Ikanalyzer realizes the Chinese word segmentation, the light has been able to achieve the Chinese search, but the light is not enough, many projects in the search, should also be able to deal with synonyms, such as the index library has "computer", "computer" such as the entry, search "Notebook" should also be able to "computer", "computer" such an entry to match, which involves the index of the synonym retrieval.


Two scenarios:

1, when the index is built, when the word is indexed, the synonym is taken into account, the term of the synonym is added to the index, and then retrieved, directly based on the input of the word to retrieve

2, in the establishment of the index, do not do any synonym processing, in the search, the first split, for the split out of the word element (hehe, self-created name) is also the key word, synonym matching, matching good synonyms into a new keyword, search index based on this keyword to retrieve.


Personally, scenario two is better than scenario one, for the following reasons: When indexing, the processing of synonyms, on the one hand, will increase the capacity of the index library, resulting in lower index efficiency; second, if the synonym is extended later, such as the original, a word has 2 synonyms, after the increase to 3, you need to rebuild the index, More Trouble!


The approximate code is as follows:

Lucene version: 4.10.3,ikanalyzer:ikanalyzer2012_hf.jar

Lucene each version of the change looks pretty big ah, methods and so on have changed much better, do not know that everyone has this feeling


To create an index:

/** * Myindexer.java * V1.0 * 2015-1-28-PM 8:53:37 * Copyright (c) Yichang * * * * limited-All rights reserved */package Com.x.same;import java.io.File; Import Java.io.ioexception;import Org.apache.lucene.analysis.analyzer;import org.apache.lucene.document.Document; Import Org.apache.lucene.document.textfield;import Org.apache.lucene.document.field.store;import Org.apache.lucene.index.indexwriter;import Org.apache.lucene.index.indexwriterconfig;import Org.apache.lucene.index.mergepolicy.onemerge;import Org.apache.lucene.store.directory;import Org.apache.lucene.store.fsdirectory;import Org.apache.lucene.util.version;import Org.lionsoul.jcseg.analyzer.jcseganalyzer4x;import Org.lionsoul.jcseg.core.jcsegtaskconfig;import org.wltea.analyzer.lucene.ikanalyzer;/** * Description of this class is: * @author yax 2015-1-28 pm 8:53:37 * @version v1.0 */public class MyI ndexer {public static void CreateIndex (String indexpath) throws Ioexception{directory Directory = Fsdirectory.open (New Fi Le (Indexpath)); Analyzer Analyzer = new Ikanalyzer ();//ikanalyZer Analyzer = new Ikanalyzer (), indexwriterconfig config = new Indexwriterconfig (version.lucene_4_10_3, analyzer); Ndexwriter IndexWriter = new IndexWriter (directory, config);D ocument document1 = new Document ();d Ocument1.add (new TextField ("title", "ThinkPad fighter in ultra-polar Notebook", store.yes)); Indexwriter.adddocument (Document1);D ocument Document2 = new Document ();d Ocument2.add (New TextField ("title", "user can configure their own extension dictionary here", Store.yes)); Indexwriter.adddocument (Document2 );D ocument document3 = new Document ();d ocument3.add ("title", "You may refer to the word breaker source code", Store.yes)); Indexwriter.adddocument (DOCUMENT3);D ocument document4 = new Document ();d Ocument4.add (New TextField ("title", " The first computer was developed by the U.S. military, specifically designed to calculate ballistic and firing characteristics, and the "Moore Group", which undertakes development tasks, consists of four scientists and engineers Eckert, Mercury, Goldstein and Boksburg. 1946 This computer main component uses the electron tube. The machine uses 1500 "+" relays, 18,800 electron tubes, covers an area of 170m2, weighs more than 30 tons, consumes 150KW, and costs $480,000. This computer can complete 5,000 addition operations per second, 400 multiplication, 300 times times faster than the fastest calculation tool at the time, is 1000 times times the relay computer, manual calculation of 200,000 times times. "+" by today's standards, it is such a "clumsy" and "low", its function is far less than a handheld programmable calculator, but it makes the scientists from the complex calculation of free, its birth marks the human entered a new era of information revolution. ", Store.yes)); inDexwriter.adddocument (DOCUMENT4); Indexwriter.close ();}} 


Synonym processing Tool class:


/** * Analyzerutil.java * V1.0 * 2015-1-28-PM 8:42:24 * Copyright (c) Yichang * * * limited-All rights reserved */package Com.x.same;import Java.io.IOE Xception;import java.io.stringreader;import java.util.arraylist;import Java.util.hashmap;import java.util.List; Import Java.util.map;import Org.apache.lucene.analysis.analyzer;import Org.apache.lucene.analysis.TokenStream; Import Org.apache.lucene.analysis.core.whitespaceanalyzer;import Org.apache.lucene.analysis.synonym.synonymfilterfactory;import Org.apache.lucene.analysis.tokenattributes.chartermattribute;import Org.apache.lucene.analysis.util.filesystemresourceloader;import Org.apache.lucene.queryparser.classic.parseexception;import Org.apache.lucene.util.version;import Org.wltea.analyzer.core.iksegmenter;import Org.wltea.analyzer.core.lexeme;import org.wltea.analyzer.lucene.ikanalyzer;/** * This class describes: * @author yax 2015-1-28 pm 8:42:24 * @version v1.0 */public class Ana Lyzerutil {/** * * This method describes: The Chinese split */public static string Analyzechinese (string input, BooleAn usersmart) throws Ioexception{stringbuffer sb = new StringBuffer ();        StringReader reader = new StringReader (Input.trim ()); Iksegmenter ikseg = new Iksegmenter (reader, Usersmart),//True with intelligent participle, false fine granularity for (lexeme lexeme = Ikseg.next (); l Exeme! = NULL;        Lexeme = Ikseg.next ()) {Sb.append (Lexeme.getlexemetext ()). Append (""); } return Sb.tostring ();} /** * * This method describes a synonym match for the phrase after the method was split, returns Tokenstream */public static Tokenstream convertsynonym (String input) throws IOE        xception{Version ver = version.lucene_4_10_3;        map<string, string> Filterargs = new hashmap<string, string> ();        Filterargs.put ("Lucenematchversion", ver.tostring ());        Filterargs.put ("Synonyms", "config/synonyms.txt");        Filterargs.put ("Expand", "true");        Synonymfilterfactory factory = new Synonymfilterfactory (Filterargs);        Factory.inform (New Filesystemresourceloader ()); Analyzer Whitespaceanalyzer = new WhitespaceanalyzER ();        Tokenstream ts = factory.create (Whitespaceanalyzer.tokenstream ("Somefield", input)); return TS;} /** * * This method describes: The tokenstream is spelled into a specially formatted string, handed to Indexsearcher to process */public static string Displaytokens (Tokenstream ts)        Throws IOException {StringBuffer sb = new StringBuffer ();        Chartermattribute termattr = Ts.addattribute (Chartermattribute.class);        Ts.reset ();            while (Ts.incrementtoken ()) {String token = termattr.tostring ();            Sb.append (token). Append (""); System.out.print (token+ "|");        /System.out.print (Offsetattribute.startoffset () + "-" + offsetattribute.endoffset () + "[" + token + "]");        } System.out.println ();        Ts.end ();        Ts.close ();    return sb.tostring ();    } public static void Main (string[] args) {String Indexpath = "D:\\search\\test";    String input = "Super"; System.out.println ("**********************"); try {String result = Displaytokens (CONVertsynonym (Analyzechinese (input, True));//myindexer.createindex (Indexpath); list<string> docs = Mysearcher.searchindex (result, Indexpath); for (String String:docs) {System.out.println ( string);}} catch (IOException e) {//Todo auto-generated catch Blocke.printstacktrace ();} catch (ParseException e) {//Todo Auto-gene Rated catch Blocke.printstacktrace ();}}}

Retrieving index-related classes:

/** * Mysearcher.java * V1.0 * 2015-1-28-PM 9:02:32 * Copyright (c) Yichang * * * limited-All rights reserved */package Com.x.same;import java.io.File; Import Java.io.ioexception;import Java.util.arraylist;import Java.util.list;import Org.apache.lucene.analysis.analyzer;import Org.apache.lucene.analysis.core.whitespaceanalyzer;import Org.apache.lucene.document.document;import Org.apache.lucene.index.directoryreader;import Org.apache.lucene.index.indexreader;import Org.apache.lucene.queryparser.classic.parseexception;import Org.apache.lucene.queryparser.classic.queryparser;import Org.apache.lucene.search.indexsearcher;import Org.apache.lucene.search.query;import Org.apache.lucene.search.topdocs;import Org.apache.lucene.store.fsdirectory;import Org.lionsoul.jcseg.analyzer.jcseganalyzer4x;import org.lionsoul.jcseg.core.jcsegtaskconfig;/** * Description of this class is: * @author yax 2015-1-28 pm 9:02:32 * @version v1.0 */public Class Mysearcher {public static list<string> searchindex (string keyword, string indexpath) throws Ioexception, parseexception{list<string> result = new arraylist<> (); Indexsearcher indexsearcher = Null;I Ndexreader Indexreader = Directoryreader.open (Fsdirectory.open (New File (Indexpath))); indexsearcher = new Indexsearcher (Indexreader); Analyzer Analyzer = new Whitespaceanalyzer (); Queryparser queryparser = new Queryparser ("title", analyzer); Query query = queryparser.parse (keyword); Topdocs td = Indexsearcher.search (query, ten); for (int i = 0; i < td.totalhits; i++) {Document document = Indexsearcher. Doc (Td.scoredocs[i].doc); Result.add (Document.get ("title"));} return result;}}


Synonym file format:

I, I, hankcs like, Is,are = is a good person, kind people, enthusiastic people super-Ben, computer, computer

Lines 1th, 3, 4 are synonyms, the second line represents the Is,are, which is the conversion of the error correction function


The code is only for reference, welcome to exchange discussion.

Reprint please indicate the source

Lucene+ikanalyzer Implement Chinese synonyms search

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.