Lucene. Net 2.3.1 Development Introduction-2. Word Segmentation (6)

Last Update:2018-12-07 Source: Internet

Author: User

Tags dot net

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The previous Lucene. Net version was 2.1, but the next (token) method overload was introduced in version 2.3.1, while the reusablestringreader class was also introduced in the new version. As a result, all word divider Before Version 2.1 has to be modified in version 2.3.1. Another problem is that some existing word divider may not be used here.

Another solution to use readtoend is to modify the Lucene. net source code.

Before modification, we need to know why the readtoend method is invalid because reusablestringreader is a subclass of stringreader. First, check the section about the readtoend method of the. NET Framework stringreader source code.

Code2.1.3.7

Code
1 Public Override String Readtoend ()
2 {
3 String STR;
4 If ( This . _ S = Null )
5 {
6_ Error. readerclosed ();
7}
8 If ( This . _ POS = 0 )
9 {
10Str= This. _ S;
11}
12 Else
13 {
14Str= This. _ S. substring (This. _ POs,This. _ Length- This. _ POS );
15}
16 This . _ POS = This . _ Length;
17 Return STR;
18 }

Code 2.1.3.7 is the source code we are looking for. Compared with the reusablestringreader class, we found that the reusablestringreader class does not assign a value to the private field "_ s" of the parent class -- stringreader. The assignment method is to call the constructor. Why didn't reusablestringreader do that? This is hard to understand, but it is relieved to look at its Java version. This class exists in the Java version, so it appears in the dot net version. This class is completely cloned according to Java code and is converted by tools. The interpreter should not have noticed that stringreader can be used directly here, or you may have noticed that the conversion was not intentionally made to maintain code consistency.

Because the reusablestringreader instance has given stringreader a null value, it is not a good idea to re-instantiate stringreader for minimum modification. Therefore, reloading a readtoend method is a good choice.

Code 2.1.3.8

Code
1 /**//// <Summary>
2///Add the readtoend method to read the characters of the entire stream.
3/// </Summary>
4/// <Returns>Returns the read character.</Returns>
5 Public Override String Readtoend ()
6 {
7 String STR;
8 If ( This . S = Null )
9 {
10Return String. Empty;//If it is null, the error is returned here.
11}
12 If ( This . Upto = 0 )
13 {< br> 14 // when the pointer is at the starting position, the entire character is returned.
15 // after the read method is called, the pointer is not in the starting position.
16 STR = This . s;
17 }
18 Else
19 {
20Str= This. S. substring (This. Upto,This. Left- This. Upto );
21}
22 This . Upto = This . Left;
23 Return STR;
24 }

After code 2.1.3.8 is written and code 2.1.3.3 is run again, it will be OK and there will be no issues that cannot be read. Test:

Search term: English
Result:
Content: English
English words, grammar, and spoken English are all important.
Spoken English, grammar, and words are all important parts of English.
-----------------------------------
Search term: syntax
Result:
Content: syntax
English words, grammar, and spoken English are all important.
Spoken English, grammar, and words are all important parts of English.
-----------------------------------
Search term: Word
Result:
Content: Word
English words, grammar, and spoken English are all important.
Spoken English, grammar, and words are all important parts of English.
To learn English well, we should not only learn grammar, words, but also spoken words.
-----------------------------------
Search term: Spoken Language
Result:
Content: Spoken Language
English words, grammar, and spoken English are all important.
Spoken English, grammar, and words are all important parts of English.
To learn English well, we should not only learn grammar, words, but also spoken words.
-----------------------------------
Search term: + content: "English" + content: "language" + content: "single" + content: "word"
Result:
+ Content: English + content: Language + content: single + content: Word
-----------------------------------

As expected, the modification was successful. Now that the source code is used, it is still good to modify the source code in an uncomfortable place.

Of course, this conversion can also be used when making word segmentation. You can directly use the buffer characters for processing, so the speed will be faster. Let's take a look at how to do it.

Now the binary analyzer can be used. Of course, the query expression also needs to be changed. How to construct a query expression is shown in2.1.2 built-in Word SegmentationAfter talking about it, we will keep more content in the future, or else the content will be put into word segmentation.

Binary word segmentation has obvious advantages over word segmentation, achieving our original goal: to reduce interference. In fact, it is best for a user to search for one or several results instead of thousands of results, which is no different from that without filtering. As a search system, the goal is to automatically help users get what they want, rather than show users how much data you have.

The disadvantage of binary word segmentation is fully exposed, that is, word segmentation is inaccurate. If you want to make Word Segmentation accurate based on binary word segmentation, you can consider this. For example, Chinese numbers are split by single words, while some special words, such as ", and so on, are also split by single words, in this way, you can use simple word segmentation to solve the problem. In a small search system, binary word segmentation is enough. If you want to achieve better results, you need to use the dictionary to match. In terms of natural language, semantic analysis is certainly the best. However, there will be another problem, because the analysis business is complex, resulting in increased development difficulty and slow operation speed, which are all issues that need to be considered during use.

As the final part of Word Segmentation in a phase, it always feels a bit cool. However, if we are talking about Word Segmentation Based on the word library, we will feel a little too early in terms of language segmentation. Therefore, we will take a quick look at it to prepare for the exploration of the index part. The more detailed application of Word Segmentation will be expanded in the advanced section.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More