Step by step to learn from me Lucene (9)---lucene search spelling and similarity query hints (spellcheck)

Source: Internet
Author: User

Suggest application Scenarios

The user's input behavior is uncertain, and when we write the program, we always want to let the user follow the specified content or the content of the specified format to search, it is necessary to manually intervene the user input search criteria , we use Baidu Google and other search engines often will see the button down when the user will be prompted to find out whether or not to search for certain relevant content, just lucene in the development of the time to think of this, Lucene provides the suggest package is used to solve the above problems.

Introduction to suggest package associative words

The Suggest package provides support for automatic completion of lucene or spell checking;

Spell-checking-related classes are under the Org.apache.lucene.search.spell package;

Lenovo-related under the Org.apache.lucene.search.suggest package;

The class based on the associative word participle is under the org.apache.lucene.search.suggest.analyzing package;


The principle of spell checking
    • Lucene's spell check is made by org. Apache. Lucene. Search. spell. The Spellchecker class provides support;
    • Spellchecker set the default accuracy of 0.5, if we need fine-grained support can be set by calling setaccuracy (float accuracy);
    • Spellchecker will index words from external sources;

These sources include:

Documentdictionary Query the value of field in document;

Filedictionary based on the directionary of a text file, each line of one item, the phrase with "\ t" tab delimiter, each item can not contain more than two separators;

Highfrequencydictionary reads the value of a term from the original index file and checks it in the number of occurrences;

Lucenedictionary also reads the value of a term from the original index file, but does not check the number of occurrences;

Plaintextdictionary read from text, read by line, no delimiter;

The principle of its index is as follows:

    1. Adding syschronized synchronization to the indexing process;
    2. Check if Spellchecker is closed, if off, throw exception, prompt content is: Spellchecker has been closed;
    3. The index of the external source is traversed, the length of the traversed word is counted, if the length is less than three, the word is ignored, the document object is constructed and indexed to the local file, and each word is split in detail when the index is created (corresponding to the Addgram method), and the execution process is as follows
/** * Indexes The data from the given {@link Dictionary}. * @param dict Dictionary to index * @param config {@link indexwriterconfig} to use * @param fullmerge whether or not T He spellcheck index should be fully merged * @throws Alreadyclosedexception if the spellchecker is already closed * @t   Hrows IOException If There is a low-level I/O error.    */public final void Indexdictionary (Dictionary dict, Indexwriterconfig config, Boolean fullmerge) throws IOException {      Synchronized (Modifycurrentindexlock) {ensureopen ();      Final Directory dir = this.spellindex;      Final IndexWriter writer = new IndexWriter (dir, config);      Indexsearcher Indexsearcher = Obtainsearcher ();      Final list<termsenum> termsenums = new arraylist<> ();      Final Indexreader reader = Searcher.getindexreader (); if (Reader.maxdoc () > 0) {for (Final Leafreadercontext ctx:reader.leaves ()) {Terms Terms = Ctx.read          ER (). terms (F_word); if (terms!)= null) Termsenums.add (Terms.iterator (null));      }} Boolean isEmpty = Termsenums.isempty ();        try {bytesrefiterator iter = Dict.getentryiterator ();                Bytesref Currentterm;          Terms:while ((currentterm = Iter.next ()) = null) {String word = currentterm.utf8tostring ();          int len = Word.length ();            if (Len < 3) {continue;//Too short we bail but ' too long ' is fine ...}                if (!isempty) {for (Termsenum te:termsenums) {if (Te.seekexact (currentterm)) {              Continue terms; }}}//OK index the word Document doc = createdocument (Word, getmin (len), Getmax          (Len));        Writer.adddocument (DOC);      }} finally {Releasesearcher (indexsearcher);      } if (Fullmerge) {writer.forcemerge (1);      }//Close writer Writer.close (); Todo:this ISN ' t that great, maybe in the future spellchecker should take//IWC-it ctor/keep its writer open? Also re-open the spell index to see we own changes when the next suggestion//is Fetched:swapsearcher (dir)    ; }  }

The method of traversing and splitting words is Addgram, which is now:

Looking at the code, the index of associative words not only focuses on the starting position of each word, but also on its reciprocal position;

  • Associative word query, the first to determine whether the grams inside contains the words to be queried after the content, if there is put into the results suggestwordqueue, the final result is to traverse Suggestwordqueue string[], its code implementation is as follows:
    Public string[] Suggestsimilar (string word, int numsug, Indexreader ir, string field, Suggestmode Suggestmode, float Accuracy) throws IOException {//Obtainsearcher calls Ensureopen final indexsearcher indexsearcher = Obtainsearcher    ();      try {if (IR = = NULL | | field = NULL) {Suggestmode = Suggestmode.suggest_always;        } if (Suggestmode = = suggestmode.suggest_always) {ir = null;      field = NULL;      } final int lengthword = Word.length (); Final int freq = (ir! = null && field! = null)?      Ir.docfreq (New Term (field, word)): 0; Final int goalfreq = suggestmode==suggestmode.suggest_more_popular?      freq:0; If the word exists in the real index and we don ' t care for word frequency, return the word itself if (suggestmode=      =suggestmode.suggest_when_not_in_index && freq > 0) {return new string[] {word};      } booleanquery query = new Booleanquery ();      String[] grams; String key;      for (int ng = Getmin (Lengthword), Ng <= Getmax (Lengthword); ng++) {key = "gram" + ng;//Form key grams = Formgrams (word, NG); Form word into ngrams (allow dups too) if (grams.length = = 0) {continue;//hmm} if (BS          Tart > 0) {//Should we boost prefixes? Add (Query, "start" + ng, grams[0], bstart); Matches start of Word} if (BEnd > 0) {//Should we boost suffixes Add (query, "end" + ng, GR Ams[grams.length-1], bEnd);        Matches end of Word} for (int i = 0; i < grams.length; i++) {Add (query, key, Grams[i]);  }} int maxhits = ten * NUMSUG;      System.out.println ("Q:" + query);  Scoredoc[] hits = indexsearcher.search (query, maxhits). Scoredocs;      System.out.println ("HITS:" + hits.length ());      Suggestwordqueue sugqueue = new Suggestwordqueue (NUMSUG, comparator); Go thru more than ' maxr ' matches on case the distance filter Triggers int stop = Math.min (Hits.length, maxhits);      Suggestword Sugword = new Suggestword ();        for (int i = 0; i < stop; i++) {sugword.string = Indexsearcher.doc (Hits[i].doc). get (F_word);//Get orig WORD         Don ' t suggest a word for itself, which would be silly if (sugWord.string.equals (word)) {continue;        }//Edit distance Sugword.score = sd.getdistance (word,sugword.string);        if (Sugword.score < accuracy) {continue; } if (ir! = null && field! = NULL) {//Use the user index sugword.freq = ir.docfreq (New term (fiel D, sugword.string)); Freq in the index//don ' t suggest a word this is not present in the field if (suggestmode==suggestm Ode. Suggest_more_popular && goalfreq > sugword.freq) | |          Sugword.freq < 1) {continue;        }} sugqueue.insertwithoverflow (Sugword);       if (sugqueue.size () = = Numsug) {   If queue full, maintain the minscore score accuracy = Sugqueue.top (). score;      } Sugword = new Suggestword ();      }//Convert to array string string[] list = new string[sugqueue.size ()];      for (int i = Sugqueue.size ()-1; I >= 0; i--) {List[i] = Sugqueue.pop (). String;    } return list;    } finally {Releasesearcher (indexsearcher); }  }
Programming practices

Here's a test program I wrote based on the filedirectory description.

Package Com.lucene.search;import Java.io.file;import Java.io.fileinputstream;import java.io.ioexception;import Java.nio.file.paths;import Org.apache.lucene.analysis.analyzer;import Org.apache.lucene.index.IndexWriterConfig; Import Org.apache.lucene.index.indexwriterconfig.openmode;import Org.apache.lucene.search.spell.SpellChecker; Import Org.apache.lucene.search.suggest.filedictionary;import Org.apache.lucene.store.directory;import Org.apache.lucene.store.fsdirectory;import Org.wltea.analyzer.lucene.ikanalyzer;public class SuggestUtil {public static void Main (string[] args) {Directory spellindexdirectory;try {spellindexdirectory = Fsdirectory.open (Paths.get (" Suggest ", new string[0])); Spellchecker spellchecker = new Spellchecker (spellindexdirectory); Analyzer Analyzer = new Ikanalyzer (true); Indexwriterconfig config = new Indexwriterconfig (analyzer); Config.setopenmode (openmode.create_or_append); spellchecker.setaccuracy (0f);//highfrequencydictionary dire = new Highfrequencydictionary (ReadeR, field, Thresh) spellchecker.indexdictionary (New Filedictionary (New FileInputStream ("D:\\hadoop\\lucene_ Suggest\\src\\suggest.txt "))), config,false); string[] Similars = spellchecker.suggestsimilar ("China", ten); for (String similar:similars) {System.out.println (similar);} Spellindexdirectory.close (); Spellchecker.close ();} catch (IOException e) {//TODO auto-generated catch Blocke.printstacktrace ();}}}


Among them, I use the suggest.txt content as:

Chinese people 100 Mercedes Benz 3101 Mercedes China 102 Mercedes S Class 103 Mercedes a class 104 Mercedes C class 105

The test results are:

Chinese people Mercedes-Benz China

Step by step with me to learn Lucene is a summary of the recent Lucene index, we have a question to contact my q-q: 891922381, at the same time I new Q-q group: 106570134 (Lucene,solr,netty,hadoop), such as Mongolia joined, Greatly appreciated, we discuss together, I strive for a daily Bo, I hope that we continue to pay attention, will bring you surprise



Step by step to learn from me Lucene (9)---lucene search spelling and similarity query hints (spellcheck)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.