Write a simple Chinese word segmentation program

Source: Internet
Author: User
A few months ago, on the internet to find a Chinese thesaurus material (hundreds of k), then want to write a word segmentation program. I do not have any research on Chinese participle, also on the basis of their own imagination to write. If there are relevant experts, please give more advice.

One, the word storehouse

Thesaurus has about 50,000 words (Google can search, similar thesaurus can be used), I summarized as follows: Region 82
Important 81
Xinhua News Agency 80
Technology 80
Meeting 80
Himself 79
Cadre 78
Staff 78
Mass 77
No 77
Today 76
Comrade 76
Department 75
Strengthen 75
Organization 75

The first column is the word, and the second column is the weight. I wrote this word segmentation algorithm at present does not use the weight.

Second, design ideas

Algorithm Brief Description:

For a string s, scan the front to back, scan each word, look for the longest match from the thesaurus. For example, suppose s= "I am a citizen of the People's Republic of China", the thesaurus has "People's Republic of China", "China", "citizen", "people", "republic" ... When the word "medium" is scanned, start from the word, and take the 1,2,3 back separately,...... Words ("Zhong", "Zhonghua", "Chinese", "Chinese people", "people of China", "People's Republic of China", "People's Republic of China", "People's Republic of China", the longest matching string in the thesaurus is "People's Republic of China", then cut apart, the scanner to advance to the "public" word.

Data:

Choosing what kind of data structure has a significant impact on performance. I use the Hashtable _roottable record thesaurus. The key value pair is (key, the number of inserts). For each word, if the word has n words, the 1,1~2,1~3 of the word,...... 1~n as a key, inserted into the _roottable. If the same key is inserted repeatedly, the subsequent value increments.

Third, the procedure

The specific procedures are as follows (the program contains weights, insertion times and other elements, the current algorithm does not use these. can write more effective word segmentation algorithm):

ChineseWordUnit.cs//struct--(words, weights) to
1 public struct Chinesewordunit
2 {
3 private string _word;
4 private int _power;
5
6/**////<summary>
7 the Chinese words corresponding to the///Chinese word units.
8///</summary>
9 public string Word
10 {
One get
12 {
return _word;
14}
15}
16
/**////<summary>
18///The weight of the Chinese word.
///</summary>
public int Power
21 {
Get
23 {
return _power;
25}
26}
27
/**////<summary>
29///structure initialization.
///</summary>
///<param name= "word" > Chinese words </param>
///<param name= "Power" > The weight of the word </param>
Chinesewordunit (string word, int power)
34 {
This._word = Word;
This._power = power;
37}
38}

ChineseWordsHashCountSet.cs//Word sink

 1     /**//// <summary>
 2    ///  a dictionary class in which the number of occurrences of a string appears in the front of the Chinese word in the Chinese dictionary. If the string "medium" appears in front of "China", record a number of times in the dictionary.
 3    /// </summary>
 4      public   Class  chinesewordshashcountset
 5

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.