Lucene application experiences and comparison of several Chinese Word Divider

Source: Internet
Author: User

Lucene application experiences and comparison of several Chinese Word divider:

1. index creation and keyword search run in different systems

If you write the index creation and keyword search on the backend and foreground systems respectively, and then deploy these two systems under the same application server (like a tomcat6.0 ), the following occurs: (a) when you click Create index in the background to run normally, an exception is reported when you click search keyword in the foreground. (B) click search at the front-end (an index has been created before) to run normally. Then, click Create index in the background. Exceptions occur in both cases (a) (B:
"Java. Lang. outofmemoryerror: Java heap space". The exception information is as follows:
2011-6-1 14:34:00 org. Apache. Catalina. Core. standardwrappervalve invoke
Severe: servlet. Service () for servlet springmvc threw exception
Java. Lang. outofmemoryerror: Java heap Space
At org. wltea. analyzer. DIC. dictsegment. fillsegment (dictsegment. Java: 167)
At org. wltea. analyzer. DIC. dictsegment. fillsegment (dictsegment. Java: 180)
At org. wltea. analyzer. DIC. dictsegment. fillsegment (dictsegment. Java: 180)
At org. wltea. analyzer. DIC. dictsegment. fillsegment (dictsegment. Java: 180)
At org. wltea. analyzer. DIC. dictsegment. fillsegment (dictsegment. Java: 156)
At org. wltea. analyzer. DIC. dictionary. loadmaindict (dictionary. Java: 97)
At org. wltea. analyzer. DIC. dictionary. <init> (dictionary. Java: 71)
At org. wltea. analyzer. DIC. dictionary. <clinit> (dictionary. Java: 41)
At org. wltea. analyzer. cfg. configuration. loadsegmenter (configuration. Java: 110)
At org. wltea. analyzer. iksegmentation. <init> (iksegmentation. Java: 54)
At org. wltea. analyzer. Lucene. iktokenizer. <init> (iktokenizer. Java: 44)
At org. wltea. analyzer. Lucene. ikanalyzer. tokenstream (ikanalyzer. Java: 45)
At org. Apache. Lucene. analysis. analyzer. reusabletokenstream (analyzer. Java: 52)
At org. Apache. Lucene. Index. docinverterperfield. processfields (docinverterperfield. Java: 126)
At org. Apache. Lucene. Index. docfieldprocessorperthread. processdocument (docfieldprocessorperthread. Java: 246)
At org. Apache. Lucene. Index. documentswriter. updatedocument (documentswriter. Java: 773)
At org. Apache. Lucene. Index. documentswriter. adddocument (documentswriter. Java: 751)
At org. Apache. Lucene. Index. indexwriter. adddocument (indexwriter. Java: 1928)
At org. Apache. Lucene. Index. indexwriter. adddocument (indexwriter. Java: 1902)
At com. fasdq. fangdake. Index. indexnews. builddocument (indexnews. Java: 210)
At com. fasdq. fangdake. Index. indexnews. createindexikanalyzer (indexnews. Java: 55)
Therefore, after analysis and testing, we can conclude that when creating indexes and keyword searches in two systems, they should be deployed separately under two Tomcat servers, so that no problem will occur.

2. Analysis of Several Chinese analyzer (paodinganalyzer ikanalyzer ):

/**
* Comparison of word divider:
* Paodinganalyzer ikanalyzer
* 1) when the keyword is "Haidian District"
* Paodinganalyzer: 'body: Haidian body: region'
* Ikanalyzer: '(body: Haidian body: Zone )'
* Both of them can be searched for "6: Haidian Beijing hello"
* 2) when the keyword is "Haidian District"
* Paodinganalyzer: 'body: "Haidian District "'
* Ikanalyzer: '(body: Haidian District body: Diandian district )'
* Only ikanalyzer can find "6: Haidian Beijing hello"
*/

In addition, the mmseg4j-1.8.2 analyzer has been tested and the results are similar to those of paodinganalyzer, but the word segmentation results are not as many as paodinganalyzer, such:

"Shandong Beijing" mmseg4j-1.8.2 word segmentation is: "Shandong Beijing", while paodinganalyzer is: "Shandong Northeast Beijing ". However, the configuration of the analyzer is complicated.

After the above experiments, ikanalyzer is easy to use.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.