The ICTCLAS Word Segmentation System is a widely acclaimed word segmentation system developed by Zhang huaping and Liu Qun of the institute of Computing Science and Technology of the Chinese Emy of sciences. What is commendable is that the free version of this version is open source code, we provide valuable learning materials for many beginners.
However, the source code does not have any supporting documentation, which may lead to some obstacles, especially for those who are unfamiliar with C/C ++. I have been using Java/VB as the main development language. I have studied C/C ++ in college, but I have never used it after work, I forgot about the syntax. however, the basics of language are the same. Besides, Java is also formed on the basis of C/C ++, with some similarities. after reading the source code, the main syntax should be correct.
Although the ICTCLAS system does not provide complete documentation, we can refer to some of the documents published by Zhang Huabei and Liu Qun to explore the main ideas.
The main idea of this word segmentation system is to first use chmm (stacked Markov Model) for word segmentation. By layering, the accuracy of Word Segmentation is increased and the efficiency of Word Segmentation is ensured. there are five layers, as shown in:
Basic Idea: Perform atomic segmentation first, then perform n-Shortest Path rough segmentation on the basis, find the first n most suitable splitting results, generate a binary word splitting table, and then generate the word splitting result, next, perform part-of-speech tagging and complete the main word segmentation steps.
The following is a study of the main content of the source code:
1. First, the ICTCLAS word segmentation program calls the cictclas_windlg: onbtnrun () program to start program execution. It can be seen that its processing method is to segment the source string. Before word segmentation, the dictionary loading process is completed. That is, when the m_ictclas object is generated, the constructor is called to load the dictionary. For more information about dictionary structure analysis, see section 2 ).
Void cictclas_windlg: onbtnrun ()
{
......
// Perform word segmentation and part-of-speech tagging here
If (! M_ICTCLAS.ParagraphProcessing (char *) (LPCTSTR) m_sSource, sResult ))
M_sResult.Format ("error: Program initialization exception! ");
Else
M_sResult.Format ("% s", sResult); // output the final word splitting result
......
}
2. the OnBtnRun () method calls the segmentation processing method bool CResult: ParagraphProcessing (char * sParagraph, char * sResult) to complete the entire process of word segmentation, including word segmentation. the first parameter is the source string, and the second parameter is the string after word segmentation. the entire word segmentation process is completed in these two methods. The following describes how to call other methods to complete the word segmentation process step by step according to the analysis framework shown in. for the sake of simplicity, we will not analyze non-Logon words first.
// Paragraph Segment and POS Tagging
Bool CResult: ParagraphProcessing (char * sParagraph, char * sResult)
{
........
Processing (sSentence, 1); // Processing and output the result of current sentence.
Output (m_pResult [0], sSentenceResult, bFirstIgnore); // Output to the imediate result
.......
}
3. The main word segmentation process occurs in the Processing () method. Next we will analyze it further.
Bool CResult: Processing (char * sSentence, unsigned int nCount)
{
......
// Perform binary word segmentation
M_Seg.BiSegment (sSentence, m_dSmoothingPara, m_dictCore, m_dictBigram, nCount );
......
// Perform part-of-speech tagging here
M_POSTagger.POSTagging (m_Seg.m_pWordSeg [nIndex], m_dictCore, m_dictCore );
......
}
4. Now let's focus on the binary word splitting, regardless of the part-of-speech tagging, because this is the first step of the two key steps of word segmentation.
References:
1. <Chinese Lexical Analysis Based on the stacked hidden horse model>, Liu Qun Zhang Hua equality
2. <N-Shortest Path-based Chinese Word Segmentation Model>, Zhang huaping Liu Qun