When I did my graduation thesis a few months ago, I needed to use Chinese Word Segmentation technology. Now I will summarize the materials I have found.

1. What is Chinese Word Segmentation

As we all know, English is based on words. Words and words are separated by spaces, while Chinese is based on words. All words in a sentence can be connected to each other to describe a meaning. For example, the English sentence "I am a student" is "I am a student" in Chinese ". A computer can easily know that "student" is a word by space, but it cannot easily understand that "Learning" and "Sheng" are used together to represent a word. The Chinese Character Sequence is segmented into meaningful words, that is, Chinese word segmentation. Some people are also called word segmentation. "I am a student", the result of Word Segmentation is: "I am a student ".

Chinese word segmentation is the basis for processing other Chinese information. search engines are only an application of Chinese word segmentation. Word Segmentation is required for other words, such as machine translation (MT), speech synthesis, automatic classification, automatic summarization, and automatic proofreading.

At present, most of the Chinese word segmentation research institutions are research institutions. Tsinghua, Peking University, Chinese Emy of Sciences, Beijing Language Institute, Northeastern University, IBM Research Institute, Microsoft Chinese Research Institute, and so on all have their own research teams, however, commercial companies that really specialize in Chinese word segmentation have almost no more than massive technologies.

Google's Chinese Word Segmentation technology is the American company named basis technology (http://www.basistech.com) provided by the Chinese Word Segmentation technology, Baidu uses its own company developed Word Segmentation technology, search is used in the domestic massive technology (http://www.hylanda.com) provided Word Segmentation technology. The Word Segmentation technology of the industry's comment on massive technologies is currently considered to be the best Chinese Word Segmentation technology in China. Its word segmentation accuracy exceeds 99%, which also makes the error rate of the search results in the search results very low.
2. Calculate the Chinese lexical analysis system ICTCLAS

Based on years of research, the Institute of Computing Technology of the Chinese Emy of Sciences developed the Chinese lexical analysis system ICTCLAS (Institute of computing technology, Chinese Lexical Analysis System) based on the multilayer hidden horse model ), the system has the following functions: Chinese word segmentation, part-of-speech tagging, and unregistered word recognition. The word segmentation accuracy rate is as high as 97.58% (the most recent 973 Expert Group evaluation results). Role-Based unregistered word recognition can achieve a recall rate higher than 90%, of which the recall rate of Chinese names is close to 98%, the processing speed of Word Segmentation and part-of-speech tagging is 31.5kb/s. The results of 14 free ICTCLAS and 14 other computing institutes were widely reported by Chinese and foreign media. Many free Chinese Word Segmentation modules in China have referenced ICTCLAS code more or less.

Download: http://www.nlp.org.cn/project/project.php? Proj_id = 6

Since ICTCLAS is written in C language, it is not convenient to use mainstream development tools. Therefore, some enthusiastic programmers have changed ICTCLAS to other languages such as Java and C.

(1) fenci, Java ICTCLAS, download page: http://www.xml.org.cn/printpage.asp? Boolean id = 2 & id = 11502

(2) AutoSplit, another Java ICTCLAS, cannot find the download page. Click Download locally

(3) Xiao Ding-dong's Chinese word segmentation has been found on the download page and cannot be found now. According to the author, improvements from ICTCLAS, there are Java, C # And C ++ three versions, Introduction page: http://www.donews.net/accesine

Iii. Massive smart word segmentation research Edition

The massive intelligent Computing Technology Research Center aims to enable researchers in the Chinese Information Processing field to share the research results of the massive intelligent center and jointly improve the level of Chinese Information Processing, we hereby release the massive Intelligent Word Segmentation research edition for research by experts, scholars and enthusiasts.

Download: http://www.hylanda.com/cgi-bin/download/download.asp? Id = 8

4. Others

(1) CSW Chinese Intelligent Word Segmentation component

Running Environment: Windows NT, 2000, XP, or higher, which can be called in Microsoft development languages such as ASP and VB.

Introduction: CSW intelligent Chinese Word Segmentation DLL component, which can automatically split a piece of text into regular Chinese phrases and separate them in a specified way, the split phrases can be marked with semantics and word frequency. It is widely used for information retrieval and analysis in various industries.

Download Page: http://www.vgoogle.net/

(2) C # Chinese Word Segmentation component written

According to the author, a DLL file can be used as a Chinese/English word splitting component. Fully C # managed code writing and independent development.

Download: http://www.rainsts.net/article.asp? Id = 48

