How to create your own Chinese Word divider so that Lucene can understand its own word Divider

Source: Internet
Author: User

Updated on: Click: 25lucene allows extension of the word divider, or in other words allows you to write your own word divider to Lucene. How does Lucene achieve this? What should we do if we design it ourselves? The following describes standardanalyzer, a standard word divider provided by Lucene.
First, let's take a look at the standardanalyzer code. To make it concise and important, it will not show all the code of standardanalyzer. First, let's look at it:
1 public class standardanalyzer extends analyzer {
2 Public tokenstream (string fieldname, reader ){
3 tokenstream result = new standardtokenizer (Reader );
4 result = new standardfilter (result );
5 result = new lowercasefilter (result );
6 result = new stopfilter (result, stopset );
7 return result;
8}
9} You Can See That standardanalyzer inherits an analyzer class. If you look at other word divider provided by Lucene, you will find all the word divider, such as keywordanalyzer, simpleanalyzer and so on all inherit this class, so it is almost certain that this class is an admission certificate provided by Lucene to integrate external word divider into Lucene. Then let's see what analyzer is? As follows:
1 public abstract class analyzer {
2 public abstract tokenstream (string fieldname, reader );
3 Public int getpositionincrementgap (string fieldname ){
4 return 0;
5}
6} It was originally an abstract class! Abstract classes are mainly responsible for doing this, that is, providing a general interface for others to inherit, but they are too lazy to do specific work, which proves this, to write your own word divider into Lucene, you must inherit this class! Next, let's look at its method.
Public abstract tokenstream (string fieldname, reader); The tokenstream literal meaning in can be understood as a word segmentation stream. What does it mean? In my own understanding, the stream is associated with the memory in Java. The word segmentation flow means that the word segmentation exists in the memory as a stream, or if we want to get a token, we need to get it through tokenstream. This explanation may not be very understandable. Here, I try my best to explain it with my own understanding. In the future, I will also come up with the actual code to explain it. In Lucene, tokenstream is interpreted as
Tokenstream enumerates the sequence of tokens, either from fields of a document or from query text. If you can understand it, you don't need to read my explanation. Just read the English section.
Public abstract tokenstream (string fieldname, reader); the first parameter in this sentence represents a field name, that is, the field name when you create an index, for example: field F = new field ("title", "hello", field. store. yes, field. index. tokenized); "title" in this sentence, while reader is Java. io. reader object. By the way, Java. io. reader is also an abstract class used to read the livestream. Its usage is generally as follows:
1 string S = "hello ";
2 reader = new stringreader (s); // If stringreader inherits from Reader, this entire sentence indicates that a tokenstream object is returned, the tokenstream object is determined by the reader object converted from the text corresponding to a field and the text to be segmented by this field. This interpretation is also quite easy to understand, however, I will try to use the code to describe it later. The public int getpositionincrementgap (string fieldname) in the analyzer class is not used in the compilation process of the analyzer, so it is not very important, but it is interpreted: when you create an index, You can process the incremental displacement of repeated entries. As for the long English comment, it is not listed here. If you have source code, you can check it yourself.
Now, after analyzing analyzer, let's go back to standardanalyzer. In standardanalyzer, we have known public tokenstream (string fieldname, reader) through the above analysis) is to return a tokenstream object, and this tokenstream object is determined by the reader object converted from the text corresponding to the field and the field to be segmented, but let's look at the method body, several lines of code are found, which are described below:
1 tokenstream result = new standardtokenizer (Reader ); // use standardtokenizer to process the reader to be segmented (I don't know why the hero in "this killer is not too cold" replied to the girl's answer when she asked him what he was doing: I am scavenger cleaner), and then return a tokenstream object.
2 result = new standardfilter (result); // indicates to filter the tokenstream object after cleaning by "scavenger)
3 result = new lowercasefilter (result ); // indicates that the tokenstream object filtered by the "standardfilter" is filtered again (which is the same as the number of times the purified water is filtered in the advertisement)
4 result = new stopfilter (result, stopset); // filter again. After the filter, the purified water can be drunk after it reaches the national standard.
5 return result; // XXX brand pure water can be officially listed for sale! From the above analysis, we can draw a conclusion in a word divider:
1. tokenizer (standardtokenizer)
2. Filter (for example, lowercasefilter)
Therefore, the general conclusion is that an analyzer includes one tokenizer and several filters (the number of filters for different word divider is not necessarily the same, just like different pure water brands, the filtering times are the same)
Now the question is back to the beginning of this article. How can Lucene recognize its own word divider and integrate it into the warm family of Lucene? Obviously, we need to follow the admission rules provided by Lucene. Therefore, if you want to write a custom token of xxxanalyzer, You have:
1. inherit analyzer and let Lucene know that I really want to be integrated into this warm family. Public class xxxanalyzer extends Analyzer
2. after joining the course, you have to work hard. You cannot be idle all day long, so you have to work hard: Public tokenstream (string fieldname, Reader), which defines the tool you work with in this method: hoe tokenizer (the day of the day, sweat drops under the soil) and sickle filter (I cut, I cut, I cut)
3. Then tell Lucene that I have finished a job (return result ;)
OK, that's it! Is it easy for Lucene to recognize its own word divider? Lucene looks very serious at the moment and makes people feel bad. Actually, after being familiar with it, we will find this "person" quite good!
Appendix:
1. careful friends will surely find that there are several pieces of code in standardanalyzer, which is about stopfilter. This blog post focuses on analyzing standardanalyzer to learn how to make Lucene know its own word divider, stopfilter does not appear to be necessary in it, so it is skipped for the moment and will be added when it is available later.
2. for the Chinese name of xxxanalyzer, I name xxxanalyzer as a Word Analyzer in a unified way, and the analyzer should be interpreted as a analyzer according to the literal meaning of English. Some books also translate it as a analyzer, in my opinion, we directly think of Chinese word segmentation, So we name it a word divider. In fact, tokenizer is a word divider and is part of xxxanalyzer, however, the main task of xxxanalyzer is tokenizer.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.