Research on Ictclas Word segmentation system (I.)

Source: Internet
Author: User

Ictclas Word system is by the Chinese Academy of Sciences Zhang Huaping, Liu Qun developed a set of widely praised the word segmentation system, it is commendable that this version of the free version of the source code, for many of our beginners to provide valuable learning materials.

But one thing that's not perfect is that the source code does not have a supporting document, reading may have some obstacles, especially for C + + people who are not familiar with. I have been using JAVA/VB as the main development of the language, C + + when the university has learned, but after work has not been used, Grammar or something almost completely forgotten. But the language of this thing, the basic things are interlinked, and Java is also in C + + based on the formation of a certain similarity. Read through the source code, the main syntax should be no problem.

Although there is no complete documentation in the Ictclas system, we can look through some of the relevant papers published by Zhang Huaping and Liu Qun, or we can pry into the main ideas.

The main idea of the word segmentation system is to CHMM (Cascade Markov Model) for word segmentation, through layering, not only increase the accuracy of word segmentation, but also ensure the efficiency of participle. It is divided into five layers, as shown in the following figure:

Basic idea: First atom segmentation, and then on the basis of the N-Shortest path rough segmentation, to find the first n most consistent with the segmentation results, generate two-yuan word list, and then the results of the component Word, then the POS tagging and completion of the main word segmentation steps.

The following is a study of the main contents of the source code:

1. First of all, the Ictclas word breaker first calls Cictclas_windlg::onbtnrun () Start the execution of the program. And it can be seen from the processing method is the source string segmentation processing. And before participle, completes the dictionary loading process, namely generates the M_ictclas object to call the constructor to complete the dictionary library loading. For the analysis of dictionary structure, please take part in the study of Word segmentation system (II.).

void Cictclas_windlg::onbtnrun ()
{

......

Place word and POS markers here

if (!m_ictclas. Paragraphprocessing ((char *) (LPCTSTR) m_ssource,sresult))
M_sresult.format ("Error: Program initialization exception.") ");
Else
M_sresult.format ("%s", sresult);/output final participle result

......

}

2. In the Onbtnrun () method calls the segmentation processing method bool Cresult::P aragraphprocessing (char *sparagraph,char *sresult) Complete the whole process of participle, It includes the POS callout of the participle. The first argument is the source string, and the second is the string after the participle. In these two methods is complete the whole word processing process, the following need to know is in this method, how to call other methods step-by-step according to the above diagram of the analysis framework to complete the word segmentation process. For simplicity's sake, let's not do the analysis of the unregistered word.

Paragraph Segment and POS Tagging
BOOL Cresult::P aragraphprocessing (char *sparagraph,char *sresult)
{

........

Processing (ssentence,1); Processing and output The result of current sentence.
Output (M_presult[0],ssentenceresult,bfirstignore); Output to the imediate result

.......

}

3. The main word segmentation is in the processing () method, which occurs in the following we have further analysis of it.

BOOL Cresult::P rocessing (char *ssentence,unsigned int ncount)
{

......

To carry out a two-prong participle

M_seg.bisegment (Ssentence, M_dsmoothingpara,m_dictcore,m_dictbigram,ncount);

......

Place a POS callout here

M_postagger.postagging (M_seg.m_pwordseg[nindex],m_dictcore,m_dictcore);

......

}

4. Now we first ignore the POS tagging, focus on the binary participle, because this is the first step of the two key steps of participle.

Reference articles:

1.<< based on cascading implicit Ma Mo-type Chinese lexical analysis >> Liu Qun Zhang Huaping, etc.

2.<< >> of Chinese words based on the n-Shortest path; Zhang Huaping Liu Qun

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.