C # Chinese Word Segmentation [statistical-based Naive Bayes algorithm]

Source: Internet
Author: User

Main ideas:
1. Have a corpus
2. Count the frequency of occurrence of each word and use it as a naive Bayes candidate.
3. Example:
The corpus contains phrases such as China, the people, the Chinese, and the republic.
Input: Chinese people love the People's Republic of China;
Use max for word splitting (score obtained from various distributions );
For example: solution1: Chinese people _ all Chinese people _ Republic
Solution2: China _ people _ love the Chinese people _ Republic
Solution3: China _ people _ love _ Zhonghua _ people _ Republic

BestSegSolution = max (solutions (segSlution [I]);

4. Word Segmentation for a Chinese character can be viewed

Seg (StringIn) = firPart + seg (StringIn-firPart); // I use score to measure the quality of the current word segmentation result.

6. Naive Bayes means that, after word segmentation, the two words are independent of each other, that is, the appearance of the latter is irrelevant to the former.

5. This is only the preliminary version. It is very simple. You need to add more things and the result will be more perfect. Of course, according to the principle of doing things, we will start from simple and try again.
 
Using System; using System. collections. generic; using System. text; using System. collections; using System. windows. forms; using System. IO; using System. diagnostics; namespace ChineseWordSeg {class NaiveBayes {private string wordLibPath = ".. /WordLib/pku_training.txt "; // The training library used is the pku corpus.
Bool trained = false; private Dictionary <string, long> wordLib = new Dictionary <string, long> (); private Dictionary <string, long> singleWordLib = new Dictionary <string, long> (); int maxLen = 0; long maxScore = 0; private string segPos = ""; // records the Split points of a single sentence, separate private string segSentence = "" by punctuation and other non-Chinese characters; // record the entire paragraph // is not a Chinese character
Bool isChineseWord (char chr) {if (chr> = 0x4E00 & chr <= 0x9FFF) return true; return false;} public void trainDate (string path) {// count the number of times each word appears

// 1. Calculate the frequency of each phrase, naiveBayes dediscrimination. The grouping method that combines different methods to obtain a high probability.
// Do you still need to hash each word?
// 2. Count the frequency of each word, just like centripetal force... it's a nonsense to see which two words are easily associated, because I didn't do that.
WordLib. Clear ();

DirectoryInfo dirInfo = new DirectoryInfo (path );
DirectoryInfo tmpDir = dirInfo. Parent;
String savePath = tmpDir. FullName;
FileInfo fInfo = new FileInfo (wordLibPath );
String fileNamePre = fInfo. Name;
SavePath + = "\" + fileNamePre + "_ trained ";
FileInfo infoOfDB = new FileInfo (savePath );

If (File. Exists (savePath) & infoOfDB. Length> 0 ){

StreamReader sr1 =
New StreamReader (@ savePath );
Char [] sep = {};

While (sr1.Peek ()! =-1)
{
String [] keyValue = sr1.ReadLine (). Split (sep );

WordLib [keyValue [0] = Convert. ToInt32 (keyValue [1]);

}

Return;
}

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.