Lucene application experiences and comparison of several Chinese Word divider:
1. index creation and keyword search run in different systems
If you write the index creation and keyword search on the backend and foreground systems respectively, and then deploy these two systems under the same application server (like a tomcat6.0 ), the following occurs: (a) when you click Create index in the background to run normally, an exception is reported when you click search keyword in the foreground. (B) click search at the front-end (an index has been created before) to run normally. Then, click Create index in the background. Exceptions occur in both cases (a) (B:
"Java. Lang. outofmemoryerror: Java heap space". The exception information is as follows:
2011-6-1 14:34:00 org. Apache. Catalina. Core. standardwrappervalve invoke
Severe: servlet. Service () for servlet springmvc threw exception
Java. Lang. outofmemoryerror: Java heap Space
At org. wltea. analyzer. DIC. dictsegment. fillsegment (dictsegment. Java: 167)
At org. wltea. analyzer. DIC. dictsegment. fillsegment (dictsegment. Java: 180)
At org. wltea. analyzer. DIC. dictsegment. fillsegment (dictsegment. Java: 180)
At org. wltea. analyzer. DIC. dictsegment. fillsegment (dictsegment. Java: 180)
At org. wltea. analyzer. DIC. dictsegment. fillsegment (dictsegment. Java: 156)
At org. wltea. analyzer. DIC. dictionary. loadmaindict (dictionary. Java: 97)
At org. wltea. analyzer. DIC. dictionary. <init> (dictionary. Java: 71)
At org. wltea. analyzer. DIC. dictionary. <clinit> (dictionary. Java: 41)
At org. wltea. analyzer. cfg. configuration. loadsegmenter (configuration. Java: 110)
At org. wltea. analyzer. iksegmentation. <init> (iksegmentation. Java: 54)
At org. wltea. analyzer. Lucene. iktokenizer. <init> (iktokenizer. Java: 44)
At org. wltea. analyzer. Lucene. ikanalyzer. tokenstream (ikanalyzer. Java: 45)
At org. Apache. Lucene. analysis. analyzer. reusabletokenstream (analyzer. Java: 52)
At org. Apache. Lucene. Index. docinverterperfield. processfields (docinverterperfield. Java: 126)
At org. Apache. Lucene. Index. docfieldprocessorperthread. processdocument (docfieldprocessorperthread. Java: 246)
At org. Apache. Lucene. Index. documentswriter. updatedocument (documentswriter. Java: 773)
At org. Apache. Lucene. Index. documentswriter. adddocument (documentswriter. Java: 751)
At org. Apache. Lucene. Index. indexwriter. adddocument (indexwriter. Java: 1928)
At org. Apache. Lucene. Index. indexwriter. adddocument (indexwriter. Java: 1902)
At com. fasdq. fangdake. Index. indexnews. builddocument (indexnews. Java: 210)
At com. fasdq. fangdake. Index. indexnews. createindexikanalyzer (indexnews. Java: 55)
Therefore, after analysis and testing, we can conclude that when creating indexes and keyword searches in two systems, they should be deployed separately under two Tomcat servers, so that no problem will occur.
2. Analysis of Several Chinese analyzer (paodinganalyzer ikanalyzer ):
/**
* Comparison of word divider:
* Paodinganalyzer ikanalyzer
* 1) when the keyword is "Haidian District"
* Paodinganalyzer: 'body: Haidian body: region'
* Ikanalyzer: '(body: Haidian body: Zone )'
* Both of them can be searched for "6: Haidian Beijing hello"
* 2) when the keyword is "Haidian District"
* Paodinganalyzer: 'body: "Haidian District "'
* Ikanalyzer: '(body: Haidian District body: Diandian district )'
* Only ikanalyzer can find "6: Haidian Beijing hello"
*/
In addition, the mmseg4j-1.8.2 analyzer has been tested and the results are similar to those of paodinganalyzer, but the word segmentation results are not as many as paodinganalyzer, such:
"Shandong Beijing" mmseg4j-1.8.2 word segmentation is: "Shandong Beijing", while paodinganalyzer is: "Shandong Northeast Beijing ". However, the configuration of the analyzer is complicated.
After the above experiments, ikanalyzer is easy to use.