Lucene application experiences and comparison of several Chinese Word Divider

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Lucene application experiences and comparison of several Chinese Word divider:

1. index creation and keyword search run in different systems

If you write the index creation and keyword search on the backend and foreground systems respectively, and then deploy these two systems under the same application server (like a tomcat6.0 ), the following occurs: (a) when you click Create index in the background to run normally, an exception is reported when you click search keyword in the foreground. (B) click search at the front-end (an index has been created before) to run normally. Then, click Create index in the background. Exceptions occur in both cases (a) (B:
"Java. Lang. outofmemoryerror: Java heap space". The exception information is as follows:
2011-6-1 14:34:00 org. Apache. Catalina. Core. standardwrappervalve invoke
Severe: servlet. Service () for servlet springmvc threw exception
Java. Lang. outofmemoryerror: Java heap Space
At org. wltea. analyzer. DIC. dictsegment. fillsegment (dictsegment. Java: 167)
At org. wltea. analyzer. DIC. dictsegment. fillsegment (dictsegment. Java: 180)
At org. wltea. analyzer. DIC. dictsegment. fillsegment (dictsegment. Java: 180)
At org. wltea. analyzer. DIC. dictsegment. fillsegment (dictsegment. Java: 180)
At org. wltea. analyzer. DIC. dictsegment. fillsegment (dictsegment. Java: 156)
At org. wltea. analyzer. DIC. dictionary. loadmaindict (dictionary. Java: 97)
At org. wltea. analyzer. DIC. dictionary. <init> (dictionary. Java: 71)
At org. wltea. analyzer. DIC. dictionary. <clinit> (dictionary. Java: 41)
At org. wltea. analyzer. cfg. configuration. loadsegmenter (configuration. Java: 110)
At org. wltea. analyzer. iksegmentation. <init> (iksegmentation. Java: 54)
At org. wltea. analyzer. Lucene. iktokenizer. <init> (iktokenizer. Java: 44)
At org. wltea. analyzer. Lucene. ikanalyzer. tokenstream (ikanalyzer. Java: 45)
At org. Apache. Lucene. analysis. analyzer. reusabletokenstream (analyzer. Java: 52)
At org. Apache. Lucene. Index. docinverterperfield. processfields (docinverterperfield. Java: 126)
At org. Apache. Lucene. Index. docfieldprocessorperthread. processdocument (docfieldprocessorperthread. Java: 246)
At org. Apache. Lucene. Index. documentswriter. updatedocument (documentswriter. Java: 773)
At org. Apache. Lucene. Index. documentswriter. adddocument (documentswriter. Java: 751)
At org. Apache. Lucene. Index. indexwriter. adddocument (indexwriter. Java: 1928)
At org. Apache. Lucene. Index. indexwriter. adddocument (indexwriter. Java: 1902)
At com. fasdq. fangdake. Index. indexnews. builddocument (indexnews. Java: 210)
At com. fasdq. fangdake. Index. indexnews. createindexikanalyzer (indexnews. Java: 55)
Therefore, after analysis and testing, we can conclude that when creating indexes and keyword searches in two systems, they should be deployed separately under two Tomcat servers, so that no problem will occur.

2. Analysis of Several Chinese analyzer (paodinganalyzer ikanalyzer ):

/**
* Comparison of word divider:
* Paodinganalyzer ikanalyzer
* 1) when the keyword is "Haidian District"
* Paodinganalyzer: 'body: Haidian body: region'
* Ikanalyzer: '(body: Haidian body: Zone )'
* Both of them can be searched for "6: Haidian Beijing hello"
* 2) when the keyword is "Haidian District"
* Paodinganalyzer: 'body: "Haidian District "'
* Ikanalyzer: '(body: Haidian District body: Diandian district )'
* Only ikanalyzer can find "6: Haidian Beijing hello"
*/

In addition, the mmseg4j-1.8.2 analyzer has been tested and the results are similar to those of paodinganalyzer, but the word segmentation results are not as many as paodinganalyzer, such:

"Shandong Beijing" mmseg4j-1.8.2 word segmentation is: "Shandong Beijing", while paodinganalyzer is: "Shandong Northeast Beijing ". However, the configuration of the analyzer is complicated.

After the above experiments, ikanalyzer is easy to use.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Lucene application experiences and comparison of several Chinese Word Divider

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Lucene application experiences and comparison of several Chinese Word Divider

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support