Write a simple Chinese word segmentation program

Last Update:2018-07-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

A few months ago, on the internet to find a Chinese thesaurus material (hundreds of k), then want to write a word segmentation program. I do not have any research on Chinese participle, also on the basis of their own imagination to write. If there are relevant experts, please give more advice.

One, the word storehouse

Thesaurus has about 50,000 words (Google can search, similar thesaurus can be used), I summarized as follows: Region 82
Important 81
Xinhua News Agency 80
Technology 80
Meeting 80
Himself 79
Cadre 78
Staff 78
Mass 77
No 77
Today 76
Comrade 76
Department 75
Strengthen 75
Organization 75

The first column is the word, and the second column is the weight. I wrote this word segmentation algorithm at present does not use the weight.

Second, design ideas

Algorithm Brief Description:

For a string s, scan the front to back, scan each word, look for the longest match from the thesaurus. For example, suppose s= "I am a citizen of the People's Republic of China", the thesaurus has "People's Republic of China", "China", "citizen", "people", "republic" ... When the word "medium" is scanned, start from the word, and take the 1,2,3 back separately,...... Words ("Zhong", "Zhonghua", "Chinese", "Chinese people", "people of China", "People's Republic of China", "People's Republic of China", "People's Republic of China", the longest matching string in the thesaurus is "People's Republic of China", then cut apart, the scanner to advance to the "public" word.

Data:

Choosing what kind of data structure has a significant impact on performance. I use the Hashtable _roottable record thesaurus. The key value pair is (key, the number of inserts). For each word, if the word has n words, the 1,1~2,1~3 of the word,...... 1~n as a key, inserted into the _roottable. If the same key is inserted repeatedly, the subsequent value increments.

Third, the procedure

The specific procedures are as follows (the program contains weights, insertion times and other elements, the current algorithm does not use these. can write more effective word segmentation algorithm):

ChineseWordUnit.cs//struct--(words, weights) to
1 public struct Chinesewordunit
2 {
3 private string _word;
4 private int _power;
5
6/**////<summary>
7 the Chinese words corresponding to the///Chinese word units.
8///</summary>
9 public string Word
10 {
One get
12 {
return _word;
14}
15}
16
/**////<summary>
18///The weight of the Chinese word.
///</summary>
public int Power
21 {
Get
23 {
return _power;
25}
26}
27
/**////<summary>
29///structure initialization.
///</summary>
///<param name= "word" > Chinese words </param>
///<param name= "Power" > The weight of the word </param>
Chinesewordunit (string word, int power)
34 {
This._word = Word;
This._power = power;
37}
38}

ChineseWordsHashCountSet.cs//Word sink

1     /**//// <summary>
2    /// a dictionary class in which the number of occurrences of a string appears in the front of the Chinese word in the Chinese dictionary. If the string "medium" appears in front of "China", record a number of times in the dictionary.
3    /// </summary>
4      public Class chinesewordshashcountset
5

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Write a simple Chinese word segmentation program

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Write a simple Chinese word segmentation program

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support