A few months ago, on the internet to find a Chinese thesaurus material (hundreds of k), then want to write a word segmentation program. I do not have any research on Chinese participle, also on the basis of their own imagination to write. If there are relevant experts, please give more advice.
One, the word storehouse
Thesaurus has about 50,000 words (Google can search, similar thesaurus can be used), I summarized as follows: Region 82
Important 81
Xinhua News Agency 80
Technology 80
Meeting 80
Himself 79
Cadre 78
Staff 78
Mass 77
No 77
Today 76
Comrade 76
Department 75
Strengthen 75
Organization 75
The first column is the word, and the second column is the weight. I wrote this word segmentation algorithm at present does not use the weight.
Second, design ideas
Algorithm Brief Description:
For a string s, scan the front to back, scan each word, look for the longest match from the thesaurus. For example, suppose s= "I am a citizen of the People's Republic of China", the thesaurus has "People's Republic of China", "China", "citizen", "people", "republic" ... When the word "medium" is scanned, start from the word, and take the 1,2,3 back separately,...... Words ("Zhong", "Zhonghua", "Chinese", "Chinese people", "people of China", "People's Republic of China", "People's Republic of China", "People's Republic of China", the longest matching string in the thesaurus is "People's Republic of China", then cut apart, the scanner to advance to the "public" word.
Data:
Choosing what kind of data structure has a significant impact on performance. I use the Hashtable _roottable record thesaurus. The key value pair is (key, the number of inserts). For each word, if the word has n words, the 1,1~2,1~3 of the word,...... 1~n as a key, inserted into the _roottable. If the same key is inserted repeatedly, the subsequent value increments.
Third, the procedure
The specific procedures are as follows (the program contains weights, insertion times and other elements, the current algorithm does not use these. can write more effective word segmentation algorithm):
ChineseWordUnit.cs//struct--(words, weights) to
1 public struct Chinesewordunit
2 {
3 private string _word;
4 private int _power;
5
6/**////<summary>
7 the Chinese words corresponding to the///Chinese word units.
8///</summary>
9 public string Word
10 {
One get
12 {
return _word;
14}
15}
16
/**////<summary>
18///The weight of the Chinese word.
///</summary>
public int Power
21 {
Get
23 {
return _power;
25}
26}
27
/**////<summary>
29///structure initialization.
///</summary>
///<param name= "word" > Chinese words </param>
///<param name= "Power" > The weight of the word </param>
Chinesewordunit (string word, int power)
34 {
This._word = Word;
This._power = power;
37}
38}
ChineseWordsHashCountSet.cs//Word sink
1 /**//// <summary>
2 /// a dictionary class in which the number of occurrences of a string appears in the front of the Chinese word in the Chinese dictionary. If the string "medium" appears in front of "China", record a number of times in the dictionary.
3 /// </summary>
4 public Class chinesewordshashcountset
5