Study Notes (iii)--lucene word breaker

Source: Internet
Author: User

lucene-Word breaker APIorg.apache.lucene.analysi.Analyzer

Parser, the core API of the word breaker component, is responsible for building a tokenstream (word breaker) that really handles word segmentation. By calling it the following two methods to get the input text of the word breaker processor.

Public final Tokenstream Tokenstream (string fieldName, Reader reader) public final Tokenstream Tokenstream (string FieldName, String text)
Tokenstreamcomponents createcomponents (String fieldName)

is the only abstract method in Analizer, the extension point. Implement your own analyzer by providing the implementation of this method.

Parameter description: FieldName, if we need to create different word processor components for different fields, we can judge by this parameter. Otherwise, this parameter is not used.

The return value is the Tokenstreamcomponents Word processor component.

We need to create the word processor component we want in the Createcomponents method

Tokenstreamcomponents

Word Processor component: This class encapsulates a Tokenstream word processor for external use. Provides access to two properties for source and sink (for external use of word breakers)

Source:

 Public Static classtokenstreamcomponents {protectedfinal Tokenizer source; protectedfinal Tokenstream sink;        Transient reusablestringreader reusablestringreader;  Publictokenstreamcomponents (tokenizer source, tokenstream result) { This. Source =source;  This. Sink =result; }         Publictokenstreamcomponents (Tokenizer source) { This. Source =source;  This. Sink =source; }        protected voidSetreader (Reader reader) { This. Source.setreader (reader); }         PublicTokenstream Gettokenstream () {return  This. Sink; }         PublicTokenizer Gettokenizer () {return  This. Source; }    }
Org.apache.lucene.analysis.TokenStream

Word processor, is responsible for the input text complete word segmentation, processing.

Concept Note: Token: A sub-item that divides an item from a stream of characters

Concept Note: Token Attribute: A sub-item attribute (information of a sub-item): such as the word, location, etc. contained

Two classes of subclasses of Tokenstream

Tokenizer: Word breaker, input is the reader character Stream Tokenstream, complete the separation from the stream

Tokenfilter: The sub-item filter, whose input is another tokenstream, completes the special handling of tokens that flow from the previous tokenstream.

Tokenstream inherited Attributesource.

Concept Note: Attribute property Token Attribute A sub-item attribute (sub-item information), such as the word of the item, the index position of the word, and so on. These properties are calculated by using different tokenizer/tokenfilter processing statistics. Different tokenizer/tokenfilter combinations, there will be different sub-item information. It's going to change dynamically, you don't know how much, what it is. So how do you store the information in a sub-item?

The answer is Attributesource, Attribute, Attributeimpl, Attributefactory.

1, Attriburesource is responsible for the storage of attribute objects, it provides the corresponding storage and extraction methods

2. You can store one or more attribute information in a attribute object

3, Attributefactory is responsible for the creation of Attributre object Factory, Tokenstream in the default use of attributefactory.getstaticimplementation we do not need to provide, Abide by its rules.

Attributesource Usage Rules Description
    • 1, a Tokenstream implementation in order to store the sub-item property, through the attributesource of one of the two add methods, the attribute object is added to Attributesource.
    • <t extends attribute> T addattribute (class<t> attclass) This method requires an interface class (inheritance Attribute) that inherits the attributes you need to add, returning the corresponding implementation class instance to you. From the interface to the instance, that's why you need to attributefactory. void Addattributeimpl (Attributeimpl att)
    • 2, the addition of each attribute implementation class in the Attributesource will only have an instance, during the word segmentation process, the sub-item is repeated use of this instance to hold the attribute information of the sub-item. Call the Add method repeatedly to add it to return the stored instance object
    • 3, to obtain a property information of the sub-item, you need to hold an instance object of a property, obtain the attribute object by AddAttribute method or Getattribure method, and then invoke the method of the instance to get, set the value
    • 4, in the Tokenstream, we use their own implementation of the attribute, the default factory. When we call this Add method, how does it know which implementation class is? Here are some rules to follow:
      • 1. Custom attribute interface myattribute inherit Attribute
      • 2. A custom attribute implementation class must inherit Attribute to implement a custom interface MyAttribute
      • 3. A custom attribute implementation class must provide an argument-free construction method
      • 4, in order to let the default factory can find the implementation class according to the custom interface, the implementation class name must be the interface name +impl. See if the attribute implementation provided in Lucene is the case.
Tokenstream Steps to use

We do not use the word breakers directly in our application, just create the word breaker objects we want for the index engine and search engines. But when we choose the word breaker, we will need to test the effect of the word breaker, we need to know how to use the resulting word processor tokenstream, using the steps:

1. Get from Tokenstream you want to get the sub-Item Property object (information is stored in the Property object)

2. Call the Reset () method of Tokenstream to reset. Because the Tokenstream is reused.

3, Loop calls Tokenstream's Incrementtoken (), a participle, until it returns false

4. Take out each sub-item in the loop and the attribute value you want.

5, call Tokenstream end (), perform the task required to finish processing.

6. Call the Close () method of Tokenstream to release the owning resource

Study Notes (iii)--lucene word breaker

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.