Open-source search framework Lucene learning tokenizer (4) -- learning the modifier mode through the word divider source code

Source: Internet
Author: User
Tags stem words

I have also learned some design patterns before, but since I have no practice, even though I thought I understood it all at the time, it was actually half-known. I personally think that it is not enough to read books and explain them by others. You must practice or understand their advantages in person before you can have a deeper understanding. InCodeI found that he used the decorator mode, and it was quite clever. So I took this case into consideration and read some books to introduce this mode. I carefully studied this mode and recorded some of my experiences below.

The decorator dynamically adds some additional responsibilities to an object. In addition, the decoration mode is more flexible than the subclass generation mode.Referenced in "big talk design mode"

Through this sentence, we can understand that the Paster mode is actually to decorate an object and add some other attributes to these objects. When implementing each decoration, you do not need to care about the internal implementation of the decoration. In fact, we can implement these functions without using the modifier mode. For example, we can add some new fields in this class or subclass to implement additional responsibilities. But after doing so, the first step is to increase the complexity of the class. The second step is to add these additional responsibilities under special circumstances. It is not good if we write them directly in the class. The modifier mode is to implement every extra responsibility through another class. If an object needs this responsibility, it will use the class that implements this responsibility to decorate this object. One advantage of doing so is to remove the decoration function from this object class, which can simplify the content of the core class. In addition, we can use code to decorate the object in different order.

Let's look at a structure diagram:

This figure comes from the big talk design pattern. First, there is a component class, which is an object interface that can dynamically add responsibilities to this object. Concretecomponent defines a specific object. You can also add some responsibilities to this object. The decorator class is an abstract class for decoration. It inherits component and extends the functionality of the component class from the external class. Concretedecoratora and concretedecoratorb are specific decorative classes and are responsible for implementing specific decorative duties. Let's take a look at the structure of the entire class of the Word Segmentation module:

We can see whether it is the same as the structure chart of the above modifier mode. Tokenstream is an abstract class. We should understand it as the above object interface (there is a lot of information on the difference between the abstract class and the interface), which corresponds to component. Tokenizer is our specific object class. In fact, it is a little different from the above, that is, tokenizer is not actually a specific object class, it is also an abstract class, the actual object classes are chartokenizer subclasses of tokenizer and chartokenizer subclasses whitespacetokenizer and lettertokenizer. These three classes implement the virtual methods defined by tokenstream and perform some extensions. They are specific object classes. The tokenfilter is an abstract class for decoration, which is equivalent to the decorator above. It also inherits from the tokenstream class. Its three sub-classes: porterstemfilter, lowercasefilter, and stopfilter are three specific modifier classes, which are used to modify the child classes of tokenizer.

Let's look at the chartokenizer sub-class. Its function is to divide the vocabulary unit stream into tokens by non-English characters. It does not process tokens. For example, we need to convert all the letters in these tokens into lowercase letters, extract their stem words, and then remove the stopword. What should we do, according to our previous understanding, we want to add some additional responsibilities to the object, so we can use the modifier mode to implement the three functions we just mentioned, we do not need to implement them in the chartokenizer class. We use three modifier classes to implement these three functions, namely porterstemfilter, lowercasefilter, and stopfilter, in fact, constructors of these three classes will receive a tokenstream type parameter, and the chartokenizer class also inherits from the tokenstream class, so we will understand that, after we create a chartokenizer class object, we can pass this object to the three modifier classes mentioned above, then implement different responsibilities in three different decorators. As we have said before, in fact, these additional responsibilities can also be implemented in the subclass of a specific object class, but this is not flexible or adds complexity. here we can see that, the token module combines the two implementation methods. For example, the token function is changed to lower-case. It is implemented using a modifier class lowercasefilter class, at the same time, it also uses a subclass, that is, lowercasetokenizer (not displayed, It is a subclass of lettertokenizer ), therefore, the effects of chartokenizer and lowercasefilter classes are the same as those of the lowercasetokenizer class, except that the chartokenizer class is implemented in the modifier mode and the lowercasefilter class is implemented in the subclass.

In the word segmentation module, there are a total of four word divider, namely simplyanalyzer, whitespaceanalyzer, stopanalyzer, and standardanalyzer. In fact, these four word splitters combine different objects with different decorators. We can look at the code.

Code in simplyanalyzer:

Public ClassSimpleanalyzer: Analyzer

 
{

/// <Summary>

/// Creates a tokenstream which tokenizes all the text in the provided reader.

/// </Summary>

Public OverrideTokenstream (string fieldname, textreader reader)

{

 
Return NewLowercasetokenizer (Reader );

 
}

 
}

We can see that the simplyanalyzer word divider only creates a lowercasetokenizer class, that is, the simplyanalyzer divides tokens by non-English letter characters and converts each token into lowercase letters. Let's take a look at the code of whitespaceanalyzer:

 
Public ClassWhitespaceanalyzer: Analyzer

{

 
/// <Summary>

/// Creates a tokenstream which tokenizes all the text in the provided textreader.

/// </Summary>

Public OverrideTokenstream (string fieldname, textreader reader)

{

 
Return NewWhitespacetokenizer (Reader );

 
}

 
}

We can see that the whitespaceanalyzer only creates a whitespacetokenizer class. Let's look at the stopanalyzer class. We only look at one method in it:

 
Public OverrideTokenstream (StringFieldname, textreader reader)

{

Return NewStopfilter (NewLowercasetokenizer (Reader), stoptable );

}

Here we can see that a lowercasetokenizer is created first, and then the class is passed to the stopfilter modifier class. The token is first divided by non-English letters and then converted to lowercase letters. But it also needs to complete an additional function, that is, remove the Stop Word. The stop filter class removes the stop word. We know that all the constructor classes receive a tokensteam class object, while lowercasetokenizer inherits from tokenstream, therefore, we can directly pass the lowercasetokenizer Class Object to the stopfilter modifier class, so that the modifier class can remove the Stop Word function.

If the above is not obvious enough, let's look at the standardanalyzer class code:

Public OverrideTokenstream (string fieldname, textreader reader)

 
{

 
Tokenstream result =NewStandardtokenizer (Reader );

Result =NewStandardfilter (result );

Result =NewLowercasefilter (result );

Result =NewStopfilter (result, stoptable );

 
ReturnResult;

 
}

Through the above code, we can clearly see that the standardanalyzer class first creates a standardtokenizer object, and then uses the three modifier classes to decorate this object, three additional functions are added for this object. Here, we can see the benefits of the Paster mode. It can effectively separate the core responsibilities of the class from the decorative functions, which not only reduces the complexity of the class, but also makes the responsibilities of each class single and flexible, if we need different functions, we can implement different modifier classes, and then use these modifier classes to decorate this object. Another benefit is that, I can adjust the decoration order at will.

The above is only the nlucene code. Actually, it was later in Lucene. in e.net, many word segmentation classes are added to implement more complex word segmentation methods and some simple Chinese Word Segmentation classes, they are all expanded on the basis of the modifier mode. No matter how complicated the word divider is, it is made up of some basic object classes and some decoration classes. Through the modifier mode, we can expand the modifier class at will, so that we can customize our own word divider, which is very flexible. This is also the advantage of the modifier mode.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.