lucene-Word breaker APIorg.apache.lucene.analysi.Analyzer
Parser, the core API of the word breaker component, is responsible for building a tokenstream (word breaker) that really handles word segmentation. By calling it the following two methods to get the input text of the word breaker processor.
Public final Tokenstream Tokenstream (string fieldName, Reader reader) public final Tokenstream Tokenstream (string FieldName, String text)
Tokenstreamcomponents createcomponents (String fieldName)
is the only abstract method in Analizer, the extension point. Implement your own analyzer by providing the implementation of this method.
Parameter description: FieldName, if we need to create different word processor components for different fields, we can judge by this parameter. Otherwise, this parameter is not used.
The return value is the Tokenstreamcomponents Word processor component.
We need to create the word processor component we want in the Createcomponents method
Tokenstreamcomponents
Word Processor component: This class encapsulates a Tokenstream word processor for external use. Provides access to two properties for source and sink (for external use of word breakers)
Source:
Public Static classtokenstreamcomponents {protectedfinal Tokenizer source; protectedfinal Tokenstream sink; Transient reusablestringreader reusablestringreader; Publictokenstreamcomponents (tokenizer source, tokenstream result) { This. Source =source; This. Sink =result; } Publictokenstreamcomponents (Tokenizer source) { This. Source =source; This. Sink =source; } protected voidSetreader (Reader reader) { This. Source.setreader (reader); } PublicTokenstream Gettokenstream () {return This. Sink; } PublicTokenizer Gettokenizer () {return This. Source; } }
Org.apache.lucene.analysis.TokenStream
Word processor, is responsible for the input text complete word segmentation, processing.
Concept Note: Token: A sub-item that divides an item from a stream of characters
Concept Note: Token Attribute: A sub-item attribute (information of a sub-item): such as the word, location, etc. contained
Two classes of subclasses of Tokenstream
Tokenizer: Word breaker, input is the reader character Stream Tokenstream, complete the separation from the stream
Tokenfilter: The sub-item filter, whose input is another tokenstream, completes the special handling of tokens that flow from the previous tokenstream.
Tokenstream inherited Attributesource.
Concept Note: Attribute property Token Attribute A sub-item attribute (sub-item information), such as the word of the item, the index position of the word, and so on. These properties are calculated by using different tokenizer/tokenfilter processing statistics. Different tokenizer/tokenfilter combinations, there will be different sub-item information. It's going to change dynamically, you don't know how much, what it is. So how do you store the information in a sub-item?
The answer is Attributesource, Attribute, Attributeimpl, Attributefactory.
1, Attriburesource is responsible for the storage of attribute objects, it provides the corresponding storage and extraction methods
2. You can store one or more attribute information in a attribute object
3, Attributefactory is responsible for the creation of Attributre object Factory, Tokenstream in the default use of attributefactory.getstaticimplementation we do not need to provide, Abide by its rules.
Attributesource Usage Rules Description
- 1, a Tokenstream implementation in order to store the sub-item property, through the attributesource of one of the two add methods, the attribute object is added to Attributesource.
- <t extends attribute> T addattribute (class<t> attclass) This method requires an interface class (inheritance Attribute) that inherits the attributes you need to add, returning the corresponding implementation class instance to you. From the interface to the instance, that's why you need to attributefactory. void Addattributeimpl (Attributeimpl att)
- 2, the addition of each attribute implementation class in the Attributesource will only have an instance, during the word segmentation process, the sub-item is repeated use of this instance to hold the attribute information of the sub-item. Call the Add method repeatedly to add it to return the stored instance object
- 3, to obtain a property information of the sub-item, you need to hold an instance object of a property, obtain the attribute object by AddAttribute method or Getattribure method, and then invoke the method of the instance to get, set the value
- 4, in the Tokenstream, we use their own implementation of the attribute, the default factory. When we call this Add method, how does it know which implementation class is? Here are some rules to follow:
- 1. Custom attribute interface myattribute inherit Attribute
- 2. A custom attribute implementation class must inherit Attribute to implement a custom interface MyAttribute
- 3. A custom attribute implementation class must provide an argument-free construction method
- 4, in order to let the default factory can find the implementation class according to the custom interface, the implementation class name must be the interface name +impl. See if the attribute implementation provided in Lucene is the case.
Tokenstream Steps to use
We do not use the word breakers directly in our application, just create the word breaker objects we want for the index engine and search engines. But when we choose the word breaker, we will need to test the effect of the word breaker, we need to know how to use the resulting word processor tokenstream, using the steps:
1. Get from Tokenstream you want to get the sub-Item Property object (information is stored in the Property object)
2. Call the Reset () method of Tokenstream to reset. Because the Tokenstream is reused.
3, Loop calls Tokenstream's Incrementtoken (), a participle, until it returns false
4. Take out each sub-item in the loop and the attribute value you want.
5, call Tokenstream end (), perform the task required to finish processing.
6. Call the Close () method of Tokenstream to release the owning resource
Study Notes (iii)--lucene word breaker