IK Word source Explanation (vii)-tokenstream and Incrementtoken attribute processing _

IK Word source Explanation (vii)-tokenstream and Incrementtoken attribute processing __ algorithm and data structure

Last Update:2018-07-27 Source: Internet

Author: User

Tags class definition reflection

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

First, we introduce the class hierarchy of Attributesource in Lucene:

Org.apache.lucene.util. Attributesource

· Org.apache.lucene.analysis. Tokenstream (implementsjava.io.Closeable)

· Org.apache.lucene.analysis. Numerictokenstream

· Org.apache.lucene.analysis. Teesinktokenfilter.sinktokenstream

· Org.apache.lucene.analysis. Tokenfilter

· Org.apache.lucene.analysis. Asciifoldingfilter

· Org.apache.lucene.analysis. Cachingtokenfilter

· Org.apache.lucene.analysis. Isolatin1accentfilter

· Org.apache.lucene.analysis. Lengthfilter

· Org.apache.lucene.analysis. Lowercasefilter

· Org.apache.lucene.analysis. Porterstemfilter

· Org.apache.lucene.analysis. Stopfilter

· Org.apache.lucene.analysis. Teesinktokenfilter

· Org.apache.lucene.analysis. Tokenizer

· Org.apache.lucene.analysis. Chartokenizer

· Org.apache.lucene.analysis. Lettertokenizer

· Org.apache.lucene.analysis. Lowercasetokenizer

· Org.apache.lucene.analysis. Whitespacetokenizer

· Org.apache.lucene.analysis. Keywordtokenizer

It can be seen from the above that Attributesource is a super class of Tokenstream ,

1. First, let's look at the implementation of Attributesource:

1.1 Member properties:

Private finalmap<class<? Extends Attribute>, attributeimpl> attributes;
Private final map<class<? Extends Attributeimpl>, attributeimpl>attributeimpls;

Privateattributefactory Factory;

Private staticfinal weakhashmap<class<? extendsattributeimpl>,linkedlist<weakreference<class<? Extendsattribute>>>> knownimplclasses;

1.1.1 Attribute:

is just an empty interface, the interface that implements this interface has FlagsAttribute, Offsetattribute, Payloadattribute,positionincrementattribute, Termattribute, Typeattribute, as you can see from their names, are associated with the properties of the Lucene item, and the interface does have a one by one correspondence with the attributes of the item, which defines the canonical operation of the specific property of the item, but only defines the behavior of the action, The data for a specific property is defined in the Attributeimpl derived class.

1.1.2AttributeImpl:

Class definition is: Public abstract class Attributeimpl implements Cloneable, Serializable,attribute

The derived class that inherits the class specifically has

Flagsattributeimpl,offsetattributeimpl, Payloadattributeimpl, Positionincrementattributeimpl,termattributeimpl, Token, Typeattributeimpl

See Flagsattributeimpl specific class definition: public class Flagsattributeimpl extends Attributeimpl Implementsflagsattribute, cloneable, Serializable

It can be seen that the implementation of the FlagsAttribute interface, in addition to see the source code of Flagsattributeimpl, there are also about the item's flag properties of data information, so we may speculate:

Attributeimpl is the property's data information and the class of the operation associated with the attribute, which is defined in the attribute interface and implemented in Attributeimpl.

1.1.3, we have concluded that, token has several attributes, the expression of each attribute is completed by attribute and Attributeimpl, attribute defines the operation of the attribute, Attributeimpl implements the interface and contains specific property data.

1.1.4

Private finalmap<class<? Extends Attribute>, attributeimpl> attributes;
Private final map<class<? Extends Attributeimpl>, attributeimpl>attributeimpls;

The above two members save two mapping relationships, attributeimpl instance corresponding to the implementation of all attribute interfaces, can be mapped to the Attributeimpl instance, this is the first mapping The second mapping is the mapping of the Attributeimpl instance to the Attributeimpl abstract class that the Attributeimpl instance corresponds to.

The purpose of the two mapping relationship is to ensure that there is only one Attributeimpl instance for each attribute and Attributeimpl in the Attributesource instance, in other words, When an instance of an object is obtained with a specific attribute or a specific attributeimpl, the instance is not created each time, but is first created, and then only the previously established. Below to analyze the implementation method;

Public Voidaddattributeimpl (Final Attributeimpl att) {

From the green section you can see that if an instance of the ATT type has previously been saved in the map, adding that type instance will fail;

//*/
Final class<? Extends Attributeimpl>clazz = Att.getclass ();
if (Attributeimpls.containskey (Clazz)) return;

The crimson part is to obtain all the attribute interfaces of the ATT implementation and save in the foundinterfaces;

//*/
Linkedlist<weakreference<class<?extends attribute>>> foundinterfaces;
Synchronized (knownimplclasses) {
Foundinterfaces = Knownimplclasses.get (clazz);
if (foundinterfaces = = null) {
We have a strong reference to Theclass instance holding all interfaces in the list (parameter "att"),
So all weakreferences are neverevicted by GC
Knownimplclasses.put (clazz,foundinterfaces = new linkedlist<weakreference<class<? extendsAttribute> >> ());
Find all interfaces that Thisattribute instance implements
And that extend the Attributeinterface
class<?> actclazz = Clazz;
do {
For (Class<?>curinterface:actclazz.getinterfaces ()) {
if (Curinterface!= attribute.class &&attribute.class.isassignablefrom (curinterface)) {
Foundinterfaces.add (new Weakreference<class<? extendsattribute>> (Curinterface.assubclass ( Attribute.class)));
}
}
Actclazz =actclazz.getsuperclass ();
while (Actclazz!= null);
}
}

The navy blue part is to add the mapping relationship between each attribute interface and ATT to the map in the ATT implementation;

//*/
Add all Interfacesof This attributeimpl to the maps
for (weakreference<class<? extends Attribute>>curinterfaceref:foundinterfaces) {
Final class<? Extends Attribute>curinterface = Curinterfaceref.get ();
ASSERT (Curinterface!= null):
"We have a strong reference onthe class holding the interfaces, so they should never";
Attribute is a superclass the This interface
if (!attributes.containskey (Curinterface)) {
Invalidate state to Forcerecomputation in Capturestate ()
This.currentstate = null;
Attributes.put (Curinterface, ATT);
Attributeimpls.put (Clazz, ATT);
}
}
}

Public <aextends attribute> A AddAttribute (class<a> attclass) {
Attributeimpl Attimpl = Attributes.get (Attclass);
if (Attimpl = = null) {
if (!) ( Attclass.isinterface () && Attribute.class.isAssignableFrom (Attclass))) {
throw New IllegalArgumentException (
"AddAttribute () only accepts a interface that extends Attribute, but" +
Attclass.getname () + "does not fulfil this contract."
);
}
Addattributeimpl (attimpl= this.factory.createAttributeInstance (Attclass));
}
Return Attclass.cast (Attimpl);
}

There is an internal class Attributefactory class in the Attributesource, which maintains the Attribute.class and Attributeimpl.class

Correspondence, but looking at the implementation of Attributefactory did not find that there is a dedicated add delete operation to maintain this relationship, in fact, its maintenance only through a function

Publicattributeimpl createattributeinstance (class<? extends Attribute>attclass), When there is a attribute.class corresponding attributeimpl.class mapping in the corresponding relationship in the attributefactory, the above function creates the Attributeimpl class instance directly with Attributeimpl.class, and returns, if not saved In such a mapping relationship, Attributefactory will add this mapping and create a class instance return using the following methods.

Attclassimplmap.put (Attclass,
New Weakreference<class<? Extends Attributeimpl>> (
Clazz = Class.forName (attclass.getname () + "Impl", True,attclass.getclassloader ())
. Assubclass (Attributeimpl.class)
)；

1.1.5

Public <aextends attribute> A getattribute (class<a> attclass): Returns the Attclass instance that is registered in the map;

Public statecapturestate (): Returns all Attributeimpl instances that are registered at the current moment.

2. Attributesource in Lucene as a reason for Tokenstream of the parent class

The role of 2.1 tokenstream is to continuously parse out the token from the text being given, in particular by Tokenstream a method Incrementtoken, each invocation will produce the next token of the parsed text, The thing that Incrementtoken do is fill in some of the attributes I care about, and use these attributes to feed back the analysis, so the natural idea is that there are a number of attribute members in the derived class of Tokenstream, It is understandable that each call Incrementtoken first clears the last attribute information and then analyzes and populates the property, but consider the nesting of the Tokenstream stream, which means that the nested inner laminar flow-acquired properties will be input into the analysis of the outer flow. If you implement Tokenstream using the above method, each laminar flow of the necessarily nested stream will have its own instance of the attribute, and the same attribute may appear between tiers, which means that the same property instance may have multiple in the stream hierarchy, which is not necessary. That is to say, there is only one instance of the same attribute at the flow level that satisfies the requirements of the analysis.

2.2 Based on 2.1 readers may say nesting when the outer flow has the same property as the inner laminar layer, the property of the outer stream can be assigned a Laminar property reference so that 2.1 can be avoided. The reason for the error is that when we are nesting, the hierarchical relationships of the nested streams are grouped by their own needs, that is to say, the outer flow is often unable to know who the inner laminar flow will be, and the premise of "assigning the property of the outer layer to laminar flow" is that the outer flow is clear about who the inner layer is, so this method is not feasible.

2.3 In fact, the above acquisition of the inner layer tokenstream in which Attributeimpl instances of subclasses can only be solved through the reflection mechanism of Java, but why Lucene to use attributesource such a complex build to achieve it. The reason is the efficiency of the consideration.

When we are nesting the flow, we know from the analysis of the attributesource that the outer flow defines the properties that are of interest to us when we attributesource the attributes that Tokenstream care about, and do not need to instantiate the attribute in the constructor. Instead, it is obtained from the Attributesource and, if it exists, directly returns the instance, or new, which shares the attributesource in the stream nesting outer flow and the memory stream, that is, when the outer and inner laminar layers are concerned about an attribute, the inner laminar flow is first initialized, At this point he will register the attribute in Attributesource so that when the outer stream is initialized, the property will be fetched to the attributesource to ensure that there is only one instance of the attribute in the flow hierarchy that is concerned about.

Why is it that no reflection implementation is based on efficiency considerations, this is because if the Attributesource implementation uses the reflection mechanism only when the property is first registered, then it is directly obtained, and if the pure reflection mechanism is used to ensure that the flow nesting level is concerned with the uniqueness of the attribute instance, it is assumed that n-tier nesting , then there will be the use of n-1 reflection mechanism, it is obvious that the implementation of Attributesource will be more efficient.

3. The processing process in IK

When initializing in IK, these required attributes are added to the word breaker, and the following is called when IK initialization

/**

* Lucene 4.0 Tokenizer Adapter class constructor

* @param in

* @param Usesmart

Public Iktokenizer (Reader in, boolean usesmart) {

Super (in);

Offsetatt = AddAttribute (offsetattribute. Class);

Termatt = AddAttribute (chartermattribute. Class);

Typeatt = AddAttribute (typeattribute. Class);

_ikimplement = new iksegmenter (input, Usesmart);

}

When the following code is executed, the object is no longer generated, but the object is fetched directly from the attributes and then returned.

Get Word Element Position property

Offsetattribute offset = Ts.addattribute (offsetattribute. Class);

Get Word meta Text properties

Chartermattribute term = Ts.addattribute (chartermattribute. Class);

Get Word meta Text properties

Typeattribute type = Ts.addattribute (typeattribute. Class);

And when you execute the following code,

Iteration to get the result of participle

while (Ts.incrementtoken ()) {

System.out.println (Offset.startoffset () + "-" + offset.endoffset () + ":" + term.tostring () + "|" + Type.type ());

}

The method invoked is Incrementtoken in Iktokenizer.java, and its code is as follows (for a fresher display of the property processing, we will Iktokenizer initialization also):

/**

* Lucene 4.0 Tokenizer Adapter class constructor

* @param in

* @param Usesmart

Public Iktokenizer (Reader in, boolean usesmart) {

Super (in);

Offsetatt = AddAttribute (offsetattribute. Class);

Termatt = AddAttribute (chartermattribute. Class);

Typeatt = AddAttribute (typeattribute. Class);

_ikimplement = new iksegmenter (input, Usesmart);

}

/* (Non-javadoc)

* @seeorg. Apache.lucene.analysis.tokenstream#incrementtoken ()

@Override

Public Boolean Incrementtoken () throws IOException {

Clear all the word meta attributes

Clearattributes ();

Lexeme nextlexeme = _ikimplement.next ();

if (Nextlexeme!= null) {

Turn the lexeme into attributes

Set Word meta text

Termatt.append (Nextlexeme.getlexemetext ());

Set the length of the word element

Termatt.setlength (Nextlexeme.getlength ());

Set Word element displacement

Offsetatt.setoffset (Nextlexeme.getbeginposition (), nextlexeme.getendposition ());

Record the last position of participle

Endposition = Nextlexeme.getendposition ();

Record Word meta category

Typeatt.settype (Nextlexeme.getlexemetypestring ());

Return true to tell you the next word

return true;

}

return false tell Word meta output complete

return false;

}

See the variable Offsetatt, Termatt, Typeatt of this initialization function that was invoked at the beginning of our initialization of the Iktokenizer are only initialized once in attributes, in Incrementtoken () When the method is processed, the values are not reinitialized when they are assigned to these properties.

Summarize:

The function of Tokenstream is to parse out the token from the text that is given, in particular by Tokenstream a method Incrementtoken, each invocation will produce the next token of the text to be parsed, The thing that Incrementtoken do is fill in some of the attributes that the user cares about, and feedback the results through these attributes, so the natural idea is that there are a number of attribute members in the Tokenstream derived class, It is understandable that each call Incrementtoken first clears the last attribute information and then analyzes and populates the property, but consider the nesting of the Tokenstream stream, which means that the nested inner laminar flow-acquired properties will be input into the analysis of the outer flow. If you implement Tokenstream using the above method, each laminar flow of the necessarily nested stream will have its own instance of the attribute, and the same attribute may appear between tiers, which means that the same property instance may have multiple in the stream hierarchy, which is not necessary. That is to say, there is only one instance of the same attribute at the flow level that satisfies the requirements of the analysis.

When we are nesting the stream when we are attributesource the attribute that Tokenstream is concerned with, we know from the analysis of Attributesource that the outer flow defines the property that it cares about and does not need to instantiate the property in the constructor. Instead, it is obtained from the Attributesource and, if it exists, directly returns the instance, or new, which shares the attributesource in the stream nesting outer flow and the memory stream, that is, when the outer and inner laminar layers are concerned about an attribute, the inner laminar flow is first initialized, At this point he will register the attribute in Attributesource so that when the outer stream is initialized, the property will be fetched to the attributesource to ensure that there is only one instance of the attribute in the flow hierarchy that is concerned about.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More