Lucene. Net 2.3.1 Development Introduction-2. Word Segmentation (6)

Source: Internet
Author: User
Tags dot net

The previous Lucene. Net version was 2.1, but the next (token) method overload was introduced in version 2.3.1, while the reusablestringreader class was also introduced in the new version. As a result, all word divider Before Version 2.1 has to be modified in version 2.3.1. Another problem is that some existing word divider may not be used here.

 

Another solution to use readtoend is to modify the Lucene. net source code.

 

Before modification, we need to know why the readtoend method is invalid because reusablestringreader is a subclass of stringreader. First, check the section about the readtoend method of the. NET Framework stringreader source code.

 

 

Code2.1.3.7

 

Code
1 Public   Override   String Readtoend ()
2 {
3 String STR;
4 If ( This . _ S =   Null )
5 {
6_ Error. readerclosed ();
7}
8 If ( This . _ POS =   0 )
9 {
10Str= This. _ S;
11}
12 Else
13 {
14Str= This. _ S. substring (This. _ POs,This. _ Length- This. _ POS );
15}
16 This . _ POS =   This . _ Length;
17 Return STR;
18 }

 

Code 2.1.3.7 is the source code we are looking for. Compared with the reusablestringreader class, we found that the reusablestringreader class does not assign a value to the private field "_ s" of the parent class -- stringreader. The assignment method is to call the constructor. Why didn't reusablestringreader do that? This is hard to understand, but it is relieved to look at its Java version. This class exists in the Java version, so it appears in the dot net version. This class is completely cloned according to Java code and is converted by tools. The interpreter should not have noticed that stringreader can be used directly here, or you may have noticed that the conversion was not intentionally made to maintain code consistency.

 

Because the reusablestringreader instance has given stringreader a null value, it is not a good idea to re-instantiate stringreader for minimum modification. Therefore, reloading a readtoend method is a good choice.

 

 

Code 2.1.3.8

 

Code
1 /**//// <Summary>
2///Add the readtoend method to read the characters of the entire stream.
3/// </Summary>
4/// <Returns>Returns the read character.</Returns>
5 Public   Override   String Readtoend ()
6 {
7 String STR;
8 If ( This . S =   Null )
9 {
10Return String. Empty;//If it is null, the error is returned here.
11}
12 If ( This . Upto =   0 )
13 {< br> 14 // when the pointer is at the starting position, the entire character is returned.
15 // after the read method is called, the pointer is not in the starting position.
16 STR = This . s;
17 }
18 Else
19 {
20Str= This. S. substring (This. Upto,This. Left- This. Upto );
21}
22 This . Upto =   This . Left;
23 Return STR;
24 }

 

After code 2.1.3.8 is written and code 2.1.3.3 is run again, it will be OK and there will be no issues that cannot be read. Test:

 

Search term: English
Result:
Content: English
English words, grammar, and spoken English are all important.
Spoken English, grammar, and words are all important parts of English.
-----------------------------------
Search term: syntax
Result:
Content: syntax
English words, grammar, and spoken English are all important.
Spoken English, grammar, and words are all important parts of English.
-----------------------------------
Search term: Word
Result:
Content: Word
English words, grammar, and spoken English are all important.
Spoken English, grammar, and words are all important parts of English.
To learn English well, we should not only learn grammar, words, but also spoken words.
-----------------------------------
Search term: Spoken Language
Result:
Content: Spoken Language
English words, grammar, and spoken English are all important.
Spoken English, grammar, and words are all important parts of English.
To learn English well, we should not only learn grammar, words, but also spoken words.
-----------------------------------
Search term: + content: "English" + content: "language" + content: "single" + content: "word"
Result:
+ Content: English + content: Language + content: single + content: Word
-----------------------------------

As expected, the modification was successful. Now that the source code is used, it is still good to modify the source code in an uncomfortable place.

 

Of course, this conversion can also be used when making word segmentation. You can directly use the buffer characters for processing, so the speed will be faster. Let's take a look at how to do it.

 

Now the binary analyzer can be used. Of course, the query expression also needs to be changed. How to construct a query expression is shown in2.1.2 built-in Word SegmentationAfter talking about it, we will keep more content in the future, or else the content will be put into word segmentation.

 

Binary word segmentation has obvious advantages over word segmentation, achieving our original goal: to reduce interference. In fact, it is best for a user to search for one or several results instead of thousands of results, which is no different from that without filtering. As a search system, the goal is to automatically help users get what they want, rather than show users how much data you have.

 

The disadvantage of binary word segmentation is fully exposed, that is, word segmentation is inaccurate. If you want to make Word Segmentation accurate based on binary word segmentation, you can consider this. For example, Chinese numbers are split by single words, while some special words, such as ", and so on, are also split by single words, in this way, you can use simple word segmentation to solve the problem. In a small search system, binary word segmentation is enough. If you want to achieve better results, you need to use the dictionary to match. In terms of natural language, semantic analysis is certainly the best. However, there will be another problem, because the analysis business is complex, resulting in increased development difficulty and slow operation speed, which are all issues that need to be considered during use.

 

As the final part of Word Segmentation in a phase, it always feels a bit cool. However, if we are talking about Word Segmentation Based on the word library, we will feel a little too early in terms of language segmentation. Therefore, we will take a quick look at it to prepare for the exploration of the index part. The more detailed application of Word Segmentation will be expanded in the advanced section.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.