The Chinese word segmentation program I developed is released Open Source

Source: Internet
Author: User
Function Introduction: Please refer to the Chinese word segmentation developed in 2 weeks. It is a bit small (some functions are not shown)
This Chinese Word Segmentation is developed based on the matching mode. ProgramTo practice your work. You can also use it directly. But it is not recommended. Because the overall architecture has some fundamental issues. However, as a reference for developing Chinese word segmentation, I believe it is of some value.
Recently, Mr. Lu Zhenyu released the ictclas c # version sharpictclas. Excellent Chinese word segmentation program. This is not at the same level as mine. However, it is estimated that sharpictclas cannot be used directly in your own applications. Because Chinese Word Segmentation not only focuses on accuracy, but also on application issues. For example, (the example below is just about communication. I hope you can think about it in the scope of communication. There is no such thing as praise ):
1. problem identification and handling
Example: sun opens Java Source code
Analysis: If sun can be identified with full-angle characters, sun can also be correctly recognized as a character, but from the search perspective. It seems that all forms need to be processed as a form of sun. Of course, this can also be seen as a question of Lucene's analyzer.

2. problems identified and handled in English
Example: U. S. A is short for the United States
Analysis: In sharpictclas, U. S. A is divided into six characters. It is estimated that the sharpictclas English processing is still relatively weak. After all, it is a free version.

3. Identify and handle problems with professional terms and special characters
For example, is Asp.net classified into ASP/./net or Asp.net? Test@test.com is as a word or separated and so on.

The Chinese example will not be used. Sharpictclas provides excellent Chinese-only word segmentation.
In fact, I raised the above question to express my opinion that there is no best Chinese word segmentation. Only the Chinese word segmentation that best fits the application needs, so the best Chinese Word Segmentation should be developed based on your own needs. I hope that my Chinese word segmentation program (although poor) can provide you with some reference. A little bit.
Architecture (class relationship diagram)

Download source code

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.