Lucene Word Flow

Source: Internet
Author: User
Tags deprecated

This one weeks to take the time to study LUCENE/SOLR, a good summary today, write some articles to record something important, so as not to confused in the future,


This article is a simple introduction of the Lucene participle process of the word segmentation process, and some simple explanation of the principle, I hope that the reader can correct, greatly appreciated!!


(a) main word breaker

Whitespaceanalyzer, Stopanalyzer, Simpleanalyzer, Keywordanalyzer, all of them are analazer, There is an abstract method in the Analazer class called Tokenstream

Package Org.apache.lucene.analysis;import Java.io.reader;import Java.io.ioexception;import java.io.Closeable; Import Java.lang.reflect.modifier;import Org.apache.lucene.util.closeablethreadlocal;import Org.apache.lucene.store.alreadyclosedexception;import org.apache.lucene.document.fieldable;/** an Analyzer builds  Tokenstreams, which analyze text. It thus represents a * policy for extracting index terms from text. * <p> * Typical implementations first build a tokenizer, which breaks the stream of * characters from the Reader  Into Raw Tokens. One or more tokenfilters is applied to the output of the tokenizer. * <p>the {@code Analyzer}-api in Lucene are based on the decorator pattern. * Therefore All Non-abstract subclasses must is final or their {@link #tokenStream} * and {@link #reusableTokenStream} imp Lementations must be final! This was checked * When Java assertions was enabled. */public Abstract class Analyzer implements Closeable {<span style= "white-sPace:pre "></span>//.....  This code extracts only the key parts/** creates a tokenstream which tokenizes all the text in the provided * Reader.   Must is able to handle null field name for * backward compatibility. */Public abstract Tokenstream Tokenstream (String fieldName, Reader Reader); }

Tokenstream two sub-categories are tokenizer and Tokenfilter, from the literal sense can be roughly seen the former is to turn a paragraph into a vocabulary unit (Analyzer put reader handed), Then go through a series of filter and finally get a complete tokenstream, see




Out of the common Tokenizer


The following is a brief talk about the Simpletokenizer process. Let's look at Simple source code.

Public final class Simpleanalyzer extends Reusableanalyzerbase {  private final Version matchversion;    /**   * Creates a new {@link Simpleanalyzer}   * @param matchversion Lucene version to match see {@link <a href= "#v Ersion ">ABOVE</A>}   *  /Public Simpleanalyzer (Version matchversion) {    this.matchversion = matchversion;  }    /**   * Creates a new {@link Simpleanalyzer}   * @deprecated use {@link #SimpleAnalyzer (Version)} instead    */< c12/> @Deprecated public  Simpleanalyzer () {This    (version.lucene_30);  }  @Override  protected tokenstreamcomponents createcomponents (final String fieldName,      final reader reader) {    return new Tokenstreamcomponents (New Lowercasetokenizer (matchversion, Reader));//This overrides the method by which the parent class created the Tokenstream. And pass in a lowercasetokeniz//er that's why the letters are converted to lowercase,  }}

It can be seen from the above class diagram that they all have a parent class called Chartokenizer, what does this class do? Obviously this class is used to divide characters, and in the Tokenstream class there is a method called Increamenttoken

Public abstract class Tokenstream extends Attributesource implements closeable {/** * consumers (i.e., {@link Index Writer}) Use this method to advance the stream to * the next token. Implementing classes must implement this method and update * The appropriate {@link attributeimpl}s with the attributes   Of the next * token. * <P> * The producer must make no assumptions about the attributes after the method * have been returned:the cal Ler may arbitrarily change it. If the producer * needs to preserve the state for subsequent calls, it can use * {@link #captureState} to create a cop   Y of the current attribute state. * <p> * This method was called for every tokens of a document, so an efficient * implementation are crucial for goo D performance. To avoid calls to * {@link #addAttribute (Class)} and {@link #getAttribute (Class)}, * references to all {@link attribut   Eimpl}s that this stream uses should is * retrieved during instantiation.  * <p> * To ensure that filters and consumers know which attributes is available, * The attributes must be added during Insta Ntiation.   Filters and consumers * is not required to check for availability of attributes in * {@link #incrementToken ()}. * * @return False for end of stream;    True otherwise */public abstract Boolean Incrementtoken () throws IOException; }

Use this method to advance the stream to * the next token. Implementing classes must implement this method and update * the appropriate

Note has been made clear that this method must be used by the subclass to determine if there is a next term, return a Boolean value, the following is the Increamenttoken method of Chartokenizer

 @Override Public Final Boolean Incrementtoken () throws IOException {clearattributes ();    if (USEOLDAPI)//TODO Remove this in LUCENE 4.0 return Incrementtokenold ();    int length = 0; int start =-1;    This variable are always initialized char[] buffer = Termatt.buffer ();        while (true) {if (Bufferindex >= datalen) {offset + = Datalen; if (!charutils.fill (Iobuffer, input)) {//Read supplementary char aware with characterutils datalen = 0;//So NE          XT Offset + = Datalen won ' t decrement offset if (length > 0) {break;            } else {finaloffset = Correctoffset (offset);          return false;        }} Datalen = Iobuffer.getlength ();      Bufferindex = 0; }//Use characterutils here to support < 3.1 UTF-16 code unit behavior if the char based methods is gone fi      nal int c = Charutils.codepointat (Iobuffer.getbuffer (), bufferindex); Bufferindex + = Character.charcOunt (c);          if (Istokenchar (c)) {//If it ' s a token char if (length = = 0) {//start of token          Assert start = =-1;        Start = offset + bufferIndex-1; } else if (length >= buffer.length-1) {//check if a supplementary could run out of bounds buffer = TERMATT.R Esizebuffer (2+length);  Make sure a supplementary fits in the buffer} length + = Character.tochars (normalize (c), buffer, length); Buffer it, normalized if (length >= max_word_len)//Buffer overflow!      Make sure to check for >= surrogate pair could break = = Test break;                           } else if (length > 0)//at Non-letter w/chars break;    Return ' em} termatt.setlength (length);    Assert Start! =-1;    Offsetatt.setoffset (Correctoffset (start), Finaloffset = Correctoffset (start+length));      return true; }

So Simpletokenizer has gone through a number of tokenizer packaging (without filter), you can see Stopanalazer tokenstreamcompent method

@Override  protected tokenstreamcomponents createcomponents (String fieldName,      Reader Reader) {    final Tokenizer Source = new Lowercasetokenizer (matchversion, reader);//After a lowercase and a stopfilter    return new Tokenstreamcomponents (source, new Stopfilter (matchversion,          Source, stopwords));  }


, after which the last can become a tokenstream stream. So what are the common filter types?



This is a common filter, such as Stopfilter,lowercasefilter, and so on, to get the Tokenizer stream and then pass through the filters before they can form a real tokenstream


(b), how to save vocabulary information

Here you have to mention 3 classes Chartermattribute (save specific words), Offsetattribute (save the offset between words), Positionincrementattribute (the position increment between the saved words and the words).

With these 3 things, you can determine the specific location of a document, such as how IS is you thank this sentence in Lucene in fact it is this way (the position is not bright wrong should be how the back should be 3, and so on)


These things have a class decision called Attributesource, which holds the information in this class. Inside, there is a static inner class called State. This class stores the location information for the current Stram class. We can capture the current state in a later process using the following method

  /** * Captures The state of all   Attributes. The return value can be passed to   * {@link #restoreState} to restore the state of this or another attributesource.
   
    */Public State  Capturestate () {    final state state = This.getcurrentstate ();    return (state = = null)? Null: (state) State.clone ();  }
   

After we can get the location information of these words, we can do a lot of things. For example synonyms (plus a word to make his offset and position increment the same), delete sensitive words and so on! This concludes the first chapter.

Reprint Please specify http://blog.csdn.net/a837199685/article



Lucene Word Flow

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.