Clucene Chinese Word Segmentation

Source: Internet
Author: User

Clucene is C ++ version of Lucene, which can be downloaded from clucene-a c ++ search engine recent research information.

One week after reading Lucene in arction, I officially started to investigate clucene in middle July. Since the Chinese characters cannot be retrieved in the demo examples, I collected information about the Chinese word segmentation of clucene on the Internet, there are three main findings:

1. Character delimiter set problems:

Project Settings: must be set to use Unicode Character Set.
When ANSI is used, Chinese characters overlap with the encoding of other languages. When analyzing texts, it is difficult to determine whether a char is half a Chinese character, or whether the first half is half a Chinese character. At the same time, clucene supports ucs2 encoding, so the VC project is directly set to use Unicode Character Set.

The corresponding byte function is changed: \ SRC \ clucene \ util \ Misc. cpp, line 76's MISC: _ cpychartowide Function
This function is used in clucene to convert a char string to a wchar_t string. However, encoding is not considered in the original code. Therefore, conversion from ANSI to ucs2 is invalid, you need to modify it to the multibytetowidechar function using the Windows API.

2. Add Chinese Word Segmentation:

There are two main directions: 1), \ SRC \ clucene \ analysis \ Standard \ standardtokenizer. cpp
This class implements the most basic text splitting function, including English Word Segmentation and digital extraction. Although CJK word extraction is included, it is not perfect. Improve standardtokenizer: readcjk to process Chinese functions. 2) Add a new chinesetokenizer. CPP is used to process Chinese word segmentation. 3) Chinese Word Segmentation Method: 2. Based on the dictionary matching method, determine the appropriate method as needed. The more complicated the problem, the better.

It took me about two weeks to track the index creation and Query Process in clucene. In the end, I only changed a few lines of code to implement a simple 2-Word Segmentation Method and implement Chinese search.

Detailed URL Information:




NextArticleVery good. I have elaborated on the system architecture and some object hierarchies, which will help you to have a deep understanding of Lucene and recommend it to you.

Open Source code Full-text retrieval engine of javasehttp: //

For Chinese word segmentation, the requirements are generally relatively simple and do not need to be too complicated. You can try it at will. For large systems, after all, speed is very important.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.