C # Chinese Word Segmentation

Source: Internet
Author: User

The Chinese Word Segmentation technology is no stranger. The most frequently accessed on the Internet during initial contact is the ICTCLAS Chinese Automatic Word Segmentation System and its source code, which was first researched by the Chinese Emy of sciences, C #, C ++, or vB can be downloaded. No matter whether you can understand the source code, you can at least know that this technology is quite mature in China.

This section briefly introduces the technologies related to word splitting algorithms and compares these algorithms. Finally, we use one of them to implement a small program for Chinese word segmentation.

① Chinese Word Segmentation Algorithm

The Chinese Word Segmentation technology has evolved into three categories: matching-based word segmentation, statistical-based word segmentation, and understanding-based word segmentation.

A. Matching-Based Word Segmentation:

Because the word segmentation method is highly automated, it is often called the mechanical word segmentation method. This method uses a certain algorithm to match the documents to be analyzed with the entries in a machine dictionary. If the matching is successful, a word is identified. Common centralized mechanical word segmentation methods include the forward maximum matching (from left to right) and reverse maximum matching (from right to left ); minimum segmentation (minimum number of times each sentence is cut out ).

B. comprehension-Based Word Segmentation:

The basic idea of this method is to perform Syntactic Analysis and semantic analysis while word segmentation, and process ambiguity using syntactic information and semantic information to improve word segmentation accuracy. That is to say, let the computer simulate the process of understanding sentences to identify words. The system mainly includes three subsystems: Word Segmentation subsystem, syntactic and semantic subsystem, and general control part. Under the coordination of the general control, the word segmentation subsystem can obtain syntaxes and semantic information about words and sentences to determine word segmentation ambiguity. This word segmentation method requires a large amount of language knowledge and information. However, due to the complexity of Chinese language knowledge, it is very difficult to organize various language information into a form that can be directly read by machines. Therefore, the comprehension-based word segmentation system is not yet in use.

C. Statistical-Based Word Segmentation:

The theoretical basis of this method is: because the word is a stable combination of words, the more times adjacent words appear at the same time in the context, the more likely they are to form a word, that is to say, the probability of adjacent co-occurrence between words can better reflect the word credibility. The frequency of the combination of adjacent co-occurrence words can be calculated to calculate their co-occurrence information. The interaction information reflects the closeness between Chinese characters. When the closeness is higher than a threshold, you can think that this word group may constitute a word. The disadvantage of this method is that it often draws out some word groups with high co-occurrence frequency but not words, such as "some" and "yours, in addition, the recognition accuracy of common words is not involved, but the overhead of time and space is large.

② Comparison of Word Segmentation Algorithms

Comparison and Analysis of common algorithms

A. Matching-Based Word Segmentation Algorithms:

Because the only basis of this method is the dictionary, whether the words in the splitting document are included in the dictionary plays a decisive role in the final word segmentation result. In general, the words that have been logged on can be well identified, while those that have not been logged on will be affected by many factors and cannot be effectively identified. This mainly depends on the matching mode, therefore, this method has a low recognition rate for new words. In addition, because the word segmentation process is only a mechanical match of strings, Chinese context information cannot be used to identify ambiguous words. Therefore, if this algorithm is used to improve the accuracy of word segmentation, the system dictionary must be updated at any time. A complete dictionary is a key factor for the success of the matching word segmentation algorithm.

B. Statistical-Based Word Segmentation Algorithms:

This method is based on the appearance of Chinese character strings in documents. It uses statistical methods to identify words based on Word Frequency. Therefore, it has a high recognition rate for new words. However, it should be pointed out that in Chinese, the same meaning often has multiple forms of expression, many synonyms, and the word reproduction rate is relatively low. For computers that lack the ability to understand, they are regarded as completely different strings and cannot be recognized as an independent word. In addition, this algorithm is used only when the corpus for statistical analysis is large enough, only when the reproduction rate of each word reaches a sufficiently large value can words be successfully identified. Therefore, the time and space complexity of the algorithm is also high, and it is relatively difficult to implement things.

C. Practical Application and Analysis

According to the above analysis, most of the existing Chinese Word Segmentation systems only use matching-based word segmentation algorithms, some advanced word segmentation systems use a matching-based hybrid word segmentation algorithm to enhance the ability to recognize unregistered words. Generally, a hybrid algorithm provides better word segmentation accuracy than a pure matching algorithm, especially for some new words, which are often keywords used by users for search, therefore, the word segmentation effect will be greatly improved when we add statistical ideas.

③ Simple word segmentation system implementation

We will use C # To develop a small System Based on the matching word splitting algorithm.

A. Read dictionary text files

The first step is to read the files in the word library in the format of .txt or other files. The purpose of reading is to use these words to build a tree. The tree structure is as follows:

The main idea of building a dictionary is to associate words in the dictionary with the relationship between the tree and the subtree. The main algorithm is as follows:

/// <Summary>
/// Dictionary data build tree
/// </Summary>
/// <Param name = "S"> </param>
Public void buildtree (string S)
{
List <cntreenode> tmproot = rootnode. childnodes;
For (INT I = 0; I <S. length; I ++)
{
Int Index = findindex (tmproot, s [I]. tostring ());
If (-1 = index)
{
Cntreenode treenode = new cntreenode (s [I]. tostring ());
Tmproot. Add (treenode );
Tmproot = tmproot [tmproot. Count-1]. childnodes;
}
Else
{
Tmproot = tmproot [Index]. childnodes;
}
}
}

B. Chinese Word Segmentation

The words entered in the text are displayed separately. The effect is as follows:

The main algorithms are as follows:

/// <Summary>
/// Chinese Word Segmentation
/// </Summary>
/// <Param name = "S"> </param>
Public String [] cnanalyse (string S)
{
Stringbuilder sb = new stringbuilder ();
List <cntreenode> tmproot = rootnode. childnodes;
Int Len = 0;
For (INT I = 0; I <S. length; I ++)
{
Int Index = findindex (tmproot, s [I]. tostring ());
If (-1 = index)
{
If (LEN = 0)
{
SB. append (S. substring (I, 1) + "| ");
}
Else
{
SB. append ("| ");
I --;
}

Tmproot = rootnode. childnodes;
Len = 0;
}
Else
{
Len ++;
SB. append (s [I]. tostring ());
Tmproot = tmproot [Index]. childnodes;
}
}

String [] STRs = sb. tostring (). Split ('| ');
Return STRs;
}

Note: In the implementation process, you can refer to other similar algorithms.

This is just a preliminary study of Word Segmentation technology. At the beginning, we will continue to optimize the current Word Segmentation System in the future and accumulate more knowledge about this aspect!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.