Lucene 4.x spellcheck instructions for use

Source: Internet
Author: User

SpellCheck is a new version of Lucene functionality, before introducing spellcheck, we need to figure out spellcheck support several data sources. The SpellCheck constructor requires an incoming dictionary interface:

  

Package org.apache.lucene.search.spell;/* * Licensed to the Apache software Foundation (ASF) under one or more * contribut  or license agreements. See the NOTICE file distributed with * This work for additional information regarding copyright ownership. * The ASF licenses this file to you under the Apache License, Version 2.0 * (the "License");  You are not a use of this file except in compliance with * the License. Obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * unless required by applicab Le law or agreed into writing, software * Distributed under the License is distributed on a "as is" BASIS, * without WAR Ranties or CONDITIONS of any KIND, either express OR implied. * See the License for the specific language governing permissions and * limitations under the License. */import java.io.ioexception;import org.apache.lucene.search.suggest.inputiterator;/** * A Simple interface Representing a Dictionary. A Dictionary * Here is a list of entries,Where every entry consists of * term, weight and payload. * */public interface Dictionary {/** * Returns an iterator through all the entries * @return iterator */Inputitera Tor Getentryiterator () throws IOException;}

Commonly used dictionary mainly have the following types, commonly used mainly in text-based and Lucene-based index building:

  

Here is a piece of code that I tested, including index build and Index queries:

  

Package Com.tianditu.com.search;import Java.io.file;import Java.io.ioexception;import Org.apache.lucene.index.directoryreader;import Org.apache.lucene.index.indexwriterconfig;import Org.apache.lucene.search.spell.lucenedictionary;import Org.apache.lucene.search.spell.spellchecker;import Org.apache.lucene.store.directory;import Org.apache.lucene.store.fsdirectory;import Org.apache.lucene.store.mmapdirectory;import Org.apache.lucene.util.version;public class GlobalSuggest {// The index built by the spelling checker private final string spell_check_folder = "c:\\spellcheck\\";//based on an existing index private final string global_pinyin_ SUGGEST = "o:\\searchwork_custom\\data_index\\pinyin2008\\";//Build index public void testIndexPinyin2008 () throws Ioexception{long start = System.currenttimemillis ();//Beijing Jiwei Times Software Co., Ltd.//string indexdir = "O:\\searchwork_custom\\data _index\\globalindex\\ ";D irectory direct = new Mmapdirectory (new File (global_pinyin_suggest)); Lucenedictionary ld = new Lucenedictionary (Directoryreader.open (direct), "name"); ld.getentryIterator ();D irectory spd = Fsdirectory.open (new File (Spell_check_folder)); Spellchecker sc = new Spellchecker (SPD),//sc.inindexwriterconfig IWC = new Indexwriterconfig (version.lucene_30,null); /write index to spellcheck directory--------------sc.indexdictionary (LD, IWC, true); Sc.close (); Long end = System.currenttimemillis (); SYSTEM.OUT.PRINTLN ("Index completed, time consuming:" + (End-start) + "MS");} public void Testindex () throws Ioexception{long start = System.currenttimemillis ();//Beijing Jiwei Times Software Co., ltd. string indexdir = "O : \\searchwork_custom\\data_index\\GlobalIndex\\ ";D irectory direct = new Mmapdirectory (new File (Indexdir)); Lucenedictionary ld = new Lucenedictionary (Directoryreader.open (direct), "name"), Ld.getentryiterator ();D irectory SPD = Fsdirectory.open (new File (Spell_check_folder)); Spellchecker sc = new Spellchecker (SPD),//sc.inindexwriterconfig IWC = new Indexwriterconfig (version.lucene_30,null); Sc.indexdictionary (LD, IWC, true); Sc.close (); Long end = System.currenttimemillis (); SYSTEM.OUT.PRINTLN ("Index completed, time consuming:" + (End-start) + "MS");} Public void Testsearch (String wd) throws ioexception{//build directorydirectory spd = Fsdirectory.open (new File (Spell_check_ FOLDER));//Instantiate the SpellCheck component Spellchecker sc = new spellchecker (SPD);//Get n the closest chance to the input keyword the third one despises the accuracy the greater the match installation actually needs to adjust string[] suggests = Sc.suggestsimilar (wd, 10,0.6f), if (Suggests!=null) {for (String word:suggests) {System.out.println ("Dou Mean: "+word);}}} /** * @param args * @throws ioexception */public static void Main (string[] args) throws IOException {Globalsuggest SPELLC Heck = new Globalsuggest ();//spellcheck.testindexpinyin2008 () Spellcheck.testsearch ("Beijing Peking Duck");// Spellcheck.testsearch ("Beijng");}}

Where Index Building code:

  

Build index public void testIndexPinyin2008 () throws Ioexception{long start = System.currenttimemillis ();//Beijing Jiwei Times Software Co., Ltd.// String Indexdir = "o:\\searchwork_custom\\data_index\\globalindex\\";D irectory direct = new Mmapdirectory (New File ( Global_pinyin_suggest)); Lucenedictionary ld = new Lucenedictionary (Directoryreader.open (direct), "name"), Ld.getentryiterator ();D irectory SPD = Fsdirectory.open (new File (Spell_check_folder)); Spellchecker sc = new Spellchecker (SPD),//sc.inindexwriterconfig IWC = new Indexwriterconfig (version.lucene_30,null); /write index to spellcheck directory--------------sc.indexdictionary (LD, IWC, true); Sc.close (); Long end = System.currenttimemillis (); SYSTEM.OUT.PRINTLN ("Index completed, time consuming:" + (End-start) + "MS");}

The code here is the index required to build the spellcheck based on an existing index.

The spellcheck query index code snippet is as follows:

  

Build directorydirectory spd = Fsdirectory.open (new File (Spell_check_folder));//Instantiate spellcheck component Spellchecker sc = new Spellchecker (SPD);//  The most approximate probability of obtaining n according to the input keyword the third contempt for accuracy the larger the match installation actually needs adjustment string[] suggests = Sc.suggestsimilar (wd, 10,0.6f); if (Suggests!=null) {for (String word:suggests) {System.out.println ("Dou You mean:" +word);}}

Correlation algorithm: The default is Levensteindistance.

     

Query Sample:

1, query Chinese characters, there is a typo situation:

    

2, query pinyin:

    

3, pinyin Chinese characters inclusions:

    

(Note: The problem is found, pinyin and Chinese characters are not the case, if you want to use, you need some sort of treatment.) )

4, if processing a long list of Chinese characters, the middle inclusion of typos:

    

Summary: It seems that spellcheck ability is still limited, if needed, may also be modified.

      

  

  

  

    

Lucene 4.x spellcheck instructions for use

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.