SpellCheck is a new version of Lucene functionality, before introducing spellcheck, we need to figure out spellcheck support several data sources. The SpellCheck constructor requires an incoming dictionary interface:
Package org.apache.lucene.search.spell;/* * Licensed to the Apache software Foundation (ASF) under one or more * contribut or license agreements. See the NOTICE file distributed with * This work for additional information regarding copyright ownership. * The ASF licenses this file to you under the Apache License, Version 2.0 * (the "License"); You are not a use of this file except in compliance with * the License. Obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * unless required by applicab Le law or agreed into writing, software * Distributed under the License is distributed on a "as is" BASIS, * without WAR Ranties or CONDITIONS of any KIND, either express OR implied. * See the License for the specific language governing permissions and * limitations under the License. */import java.io.ioexception;import org.apache.lucene.search.suggest.inputiterator;/** * A Simple interface Representing a Dictionary. A Dictionary * Here is a list of entries,Where every entry consists of * term, weight and payload. * */public interface Dictionary {/** * Returns an iterator through all the entries * @return iterator */Inputitera Tor Getentryiterator () throws IOException;}
Commonly used dictionary mainly have the following types, commonly used mainly in text-based and Lucene-based index building:
Here is a piece of code that I tested, including index build and Index queries:
Package Com.tianditu.com.search;import Java.io.file;import Java.io.ioexception;import Org.apache.lucene.index.directoryreader;import Org.apache.lucene.index.indexwriterconfig;import Org.apache.lucene.search.spell.lucenedictionary;import Org.apache.lucene.search.spell.spellchecker;import Org.apache.lucene.store.directory;import Org.apache.lucene.store.fsdirectory;import Org.apache.lucene.store.mmapdirectory;import Org.apache.lucene.util.version;public class GlobalSuggest {// The index built by the spelling checker private final string spell_check_folder = "c:\\spellcheck\\";//based on an existing index private final string global_pinyin_ SUGGEST = "o:\\searchwork_custom\\data_index\\pinyin2008\\";//Build index public void testIndexPinyin2008 () throws Ioexception{long start = System.currenttimemillis ();//Beijing Jiwei Times Software Co., Ltd.//string indexdir = "O:\\searchwork_custom\\data _index\\globalindex\\ ";D irectory direct = new Mmapdirectory (new File (global_pinyin_suggest)); Lucenedictionary ld = new Lucenedictionary (Directoryreader.open (direct), "name"); ld.getentryIterator ();D irectory spd = Fsdirectory.open (new File (Spell_check_folder)); Spellchecker sc = new Spellchecker (SPD),//sc.inindexwriterconfig IWC = new Indexwriterconfig (version.lucene_30,null); /write index to spellcheck directory--------------sc.indexdictionary (LD, IWC, true); Sc.close (); Long end = System.currenttimemillis (); SYSTEM.OUT.PRINTLN ("Index completed, time consuming:" + (End-start) + "MS");} public void Testindex () throws Ioexception{long start = System.currenttimemillis ();//Beijing Jiwei Times Software Co., ltd. string indexdir = "O : \\searchwork_custom\\data_index\\GlobalIndex\\ ";D irectory direct = new Mmapdirectory (new File (Indexdir)); Lucenedictionary ld = new Lucenedictionary (Directoryreader.open (direct), "name"), Ld.getentryiterator ();D irectory SPD = Fsdirectory.open (new File (Spell_check_folder)); Spellchecker sc = new Spellchecker (SPD),//sc.inindexwriterconfig IWC = new Indexwriterconfig (version.lucene_30,null); Sc.indexdictionary (LD, IWC, true); Sc.close (); Long end = System.currenttimemillis (); SYSTEM.OUT.PRINTLN ("Index completed, time consuming:" + (End-start) + "MS");} Public void Testsearch (String wd) throws ioexception{//build directorydirectory spd = Fsdirectory.open (new File (Spell_check_ FOLDER));//Instantiate the SpellCheck component Spellchecker sc = new spellchecker (SPD);//Get n the closest chance to the input keyword the third one despises the accuracy the greater the match installation actually needs to adjust string[] suggests = Sc.suggestsimilar (wd, 10,0.6f), if (Suggests!=null) {for (String word:suggests) {System.out.println ("Dou Mean: "+word);}}} /** * @param args * @throws ioexception */public static void Main (string[] args) throws IOException {Globalsuggest SPELLC Heck = new Globalsuggest ();//spellcheck.testindexpinyin2008 () Spellcheck.testsearch ("Beijing Peking Duck");// Spellcheck.testsearch ("Beijng");}}
Where Index Building code:
Build index public void testIndexPinyin2008 () throws Ioexception{long start = System.currenttimemillis ();//Beijing Jiwei Times Software Co., Ltd.// String Indexdir = "o:\\searchwork_custom\\data_index\\globalindex\\";D irectory direct = new Mmapdirectory (New File ( Global_pinyin_suggest)); Lucenedictionary ld = new Lucenedictionary (Directoryreader.open (direct), "name"), Ld.getentryiterator ();D irectory SPD = Fsdirectory.open (new File (Spell_check_folder)); Spellchecker sc = new Spellchecker (SPD),//sc.inindexwriterconfig IWC = new Indexwriterconfig (version.lucene_30,null); /write index to spellcheck directory--------------sc.indexdictionary (LD, IWC, true); Sc.close (); Long end = System.currenttimemillis (); SYSTEM.OUT.PRINTLN ("Index completed, time consuming:" + (End-start) + "MS");}
The code here is the index required to build the spellcheck based on an existing index.
The spellcheck query index code snippet is as follows:
Build directorydirectory spd = Fsdirectory.open (new File (Spell_check_folder));//Instantiate spellcheck component Spellchecker sc = new Spellchecker (SPD);// The most approximate probability of obtaining n according to the input keyword the third contempt for accuracy the larger the match installation actually needs adjustment string[] suggests = Sc.suggestsimilar (wd, 10,0.6f); if (Suggests!=null) {for (String word:suggests) {System.out.println ("Dou You mean:" +word);}}
Correlation algorithm: The default is Levensteindistance.
Query Sample:
1, query Chinese characters, there is a typo situation:
2, query pinyin:
3, pinyin Chinese characters inclusions:
(Note: The problem is found, pinyin and Chinese characters are not the case, if you want to use, you need some sort of treatment.) )
4, if processing a long list of Chinese characters, the middle inclusion of typos:
Summary: It seems that spellcheck ability is still limited, if needed, may also be modified.
Lucene 4.x spellcheck instructions for use