Study Notes (iii)--lucene word breaker

Last Update:2018-05-20 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

lucene-Word breaker APIorg.apache.lucene.analysi.Analyzer

Parser, the core API of the word breaker component, is responsible for building a tokenstream (word breaker) that really handles word segmentation. By calling it the following two methods to get the input text of the word breaker processor.

Public final Tokenstream Tokenstream (string fieldName, Reader reader) public final Tokenstream Tokenstream (string FieldName, String text)

Tokenstreamcomponents createcomponents (String fieldName)

is the only abstract method in Analizer, the extension point. Implement your own analyzer by providing the implementation of this method.

Parameter description: FieldName, if we need to create different word processor components for different fields, we can judge by this parameter. Otherwise, this parameter is not used.

The return value is the Tokenstreamcomponents Word processor component.

We need to create the word processor component we want in the Createcomponents method

Tokenstreamcomponents

Word Processor component: This class encapsulates a Tokenstream word processor for external use. Provides access to two properties for source and sink (for external use of word breakers)

Source:

 Public Static classtokenstreamcomponents {protectedfinal Tokenizer source; protectedfinal Tokenstream sink;        Transient reusablestringreader reusablestringreader;  Publictokenstreamcomponents (tokenizer source, tokenstream result) { This. Source =source;  This. Sink =result; }         Publictokenstreamcomponents (Tokenizer source) { This. Source =source;  This. Sink =source; }        protected voidSetreader (Reader reader) { This. Source.setreader (reader); }         PublicTokenstream Gettokenstream () {return  This. Sink; }         PublicTokenizer Gettokenizer () {return  This. Source; }    }

Org.apache.lucene.analysis.TokenStream

Word processor, is responsible for the input text complete word segmentation, processing.

Concept Note: Token: A sub-item that divides an item from a stream of characters

Concept Note: Token Attribute: A sub-item attribute (information of a sub-item): such as the word, location, etc. contained

Two classes of subclasses of Tokenstream

Tokenizer: Word breaker, input is the reader character Stream Tokenstream, complete the separation from the stream

Tokenfilter: The sub-item filter, whose input is another tokenstream, completes the special handling of tokens that flow from the previous tokenstream.

Tokenstream inherited Attributesource.

Concept Note: Attribute property Token Attribute A sub-item attribute (sub-item information), such as the word of the item, the index position of the word, and so on. These properties are calculated by using different tokenizer/tokenfilter processing statistics. Different tokenizer/tokenfilter combinations, there will be different sub-item information. It's going to change dynamically, you don't know how much, what it is. So how do you store the information in a sub-item?

The answer is Attributesource, Attribute, Attributeimpl, Attributefactory.

1, Attriburesource is responsible for the storage of attribute objects, it provides the corresponding storage and extraction methods

2. You can store one or more attribute information in a attribute object

3, Attributefactory is responsible for the creation of Attributre object Factory, Tokenstream in the default use of attributefactory.getstaticimplementation we do not need to provide, Abide by its rules.

Attributesource Usage Rules Description

1, a Tokenstream implementation in order to store the sub-item property, through the attributesource of one of the two add methods, the attribute object is added to Attributesource.
<t extends attribute> T addattribute (class<t> attclass) This method requires an interface class (inheritance Attribute) that inherits the attributes you need to add, returning the corresponding implementation class instance to you. From the interface to the instance, that's why you need to attributefactory. void Addattributeimpl (Attributeimpl att)
2, the addition of each attribute implementation class in the Attributesource will only have an instance, during the word segmentation process, the sub-item is repeated use of this instance to hold the attribute information of the sub-item. Call the Add method repeatedly to add it to return the stored instance object
3, to obtain a property information of the sub-item, you need to hold an instance object of a property, obtain the attribute object by AddAttribute method or Getattribure method, and then invoke the method of the instance to get, set the value
4, in the Tokenstream, we use their own implementation of the attribute, the default factory. When we call this Add method, how does it know which implementation class is? Here are some rules to follow:
- 1. Custom attribute interface myattribute inherit Attribute
- 2. A custom attribute implementation class must inherit Attribute to implement a custom interface MyAttribute
- 3. A custom attribute implementation class must provide an argument-free construction method
- 4, in order to let the default factory can find the implementation class according to the custom interface, the implementation class name must be the interface name +impl. See if the attribute implementation provided in Lucene is the case.

Tokenstream Steps to use

We do not use the word breakers directly in our application, just create the word breaker objects we want for the index engine and search engines. But when we choose the word breaker, we will need to test the effect of the word breaker, we need to know how to use the resulting word processor tokenstream, using the steps:

1. Get from Tokenstream you want to get the sub-Item Property object (information is stored in the Property object)

2. Call the Reset () method of Tokenstream to reset. Because the Tokenstream is reused.

3, Loop calls Tokenstream's Incrementtoken (), a participle, until it returns false

4. Take out each sub-item in the loop and the attribute value you want.

5, call Tokenstream end (), perform the task required to finish processing.

6. Call the Close () method of Tokenstream to release the owning resource

Study Notes (iii)--lucene word breaker

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Study Notes (iii)--lucene word breaker

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Study Notes (iii)--lucene word breaker

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support