Clucene is C ++ version of Lucene, which can be downloaded from clucene-a c ++ search engine http://sourceforge.net/projects/clucene/CodeAnd recent research information.
One week after reading Lucene in arction, I officially started to investigate clucene in middle July. Since the Chinese characters cannot be retrieved in the demo examples, I collected information about the Chinese word segmentation of clucene on the Internet, there are three main findings:
1. Character delimiter set problems:
Project Settings: must be set to use Unicode Character Set.
When ANSI is used, Chinese characters overlap with the encoding of other languages. When analyzing texts, it is difficult to determine whether a char is half a Chinese character, or whether the first half is half a Chinese character. At the same time, clucene supports ucs2 encoding, so the VC project is directly set to use Unicode Character Set.
The corresponding byte function is changed: \ SRC \ clucene \ util \ Misc. cpp, line 76's MISC: _ cpychartowide Function
This function is used in clucene to convert a char string to a wchar_t string. However, encoding is not considered in the original code. Therefore, conversion from ANSI to ucs2 is invalid, you need to modify it to the multibytetowidechar function using the Windows API.
2. Add Chinese Word Segmentation:
There are two main directions: 1), \ SRC \ clucene \ analysis \ Standard \ standardtokenizer. cpp
This class implements the most basic text splitting function, including English Word Segmentation and digital extraction. Although CJK word extraction is included, it is not perfect. Improve standardtokenizer: readcjk to process Chinese functions. 2) Add a new chinesetokenizer. CPP is used to process Chinese word segmentation. 3) Chinese Word Segmentation Method: 2. Based on the dictionary matching method, determine the appropriate method as needed. The more complicated the problem, the better.
It took me about two weeks to track the index creation and Query Process in clucene. In the end, I only changed a few lines of code to implement a simple 2-Word Segmentation Method and implement Chinese search.
Detailed URL Information:
Http://hi.baidu.com/developer_chen/blog/item/8c4c62dfc5a3a7124954039c.html
Http://hi.baidu.com/_1_1_1_1/blog/item/be1fe41f9fbf0f62f724e475.html
Http://www.cnblogs.com/sunli
NextArticleVery good. I have elaborated on the system architecture and some object hierarchies, which will help you to have a deep understanding of Lucene and recommend it to you.
Open Source code Full-text retrieval engine of javasehttp: // blog.csdn.net/heiyeshuwu/archive/2006/04/14/662805.aspx
For Chinese word segmentation, the requirements are generally relatively simple and do not need to be too complicated. You can try it at will. For large systems, after all, speed is very important.