Main ideas:
1. Have a corpus
2. Count the frequency of occurrence of each word and use it as a naive Bayes candidate.
3. Example:
The corpus contains phrases such as China, the people, the Chinese, and the republic.
Input: Chinese people love the People's Republic of China;
Use max for word splitting (score obtained from various distributions );
For example: solution1: Chinese people _ all Chinese people _ Republic
Solution2: China _ people _ love the Chinese people _ Republic
Solution3: China _ people _ love _ Zhonghua _ people _ Republic
BestSegSolution = max (solutions (segSlution [I]);
4. Word Segmentation for a Chinese character can be viewed
Seg (StringIn) = firPart + seg (StringIn-firPart); // I use score to measure the quality of the current word segmentation result.
6. Naive Bayes means that, after word segmentation, the two words are independent of each other, that is, the appearance of the latter is irrelevant to the former.
5. This is only the preliminary version. It is very simple. You need to add more things and the result will be more perfect. Of course, according to the principle of doing things, we will start from simple and try again.
Using System; using System. collections. generic; using System. text; using System. collections; using System. windows. forms; using System. IO; using System. diagnostics; namespace ChineseWordSeg {class NaiveBayes {private string wordLibPath = ".. /WordLib/pku_training.txt "; // The training library used is the pku corpus.
Bool trained = false; private Dictionary <string, long> wordLib = new Dictionary <string, long> (); private Dictionary <string, long> singleWordLib = new Dictionary <string, long> (); int maxLen = 0; long maxScore = 0; private string segPos = ""; // records the Split points of a single sentence, separate private string segSentence = "" by punctuation and other non-Chinese characters; // record the entire paragraph // is not a Chinese character
Bool isChineseWord (char chr) {if (chr> = 0x4E00 & chr <= 0x9FFF) return true; return false;} public void trainDate (string path) {// count the number of times each word appears
// 1. Calculate the frequency of each phrase, naiveBayes dediscrimination. The grouping method that combines different methods to obtain a high probability.
// Do you still need to hash each word?
// 2. Count the frequency of each word, just like centripetal force... it's a nonsense to see which two words are easily associated, because I didn't do that.
WordLib. Clear ();
DirectoryInfo dirInfo = new DirectoryInfo (path );
DirectoryInfo tmpDir = dirInfo. Parent;
String savePath = tmpDir. FullName;
FileInfo fInfo = new FileInfo (wordLibPath );
String fileNamePre = fInfo. Name;
SavePath + = "\" + fileNamePre + "_ trained ";
FileInfo infoOfDB = new FileInfo (savePath );
If (File. Exists (savePath) & infoOfDB. Length> 0 ){
StreamReader sr1 =
New StreamReader (@ savePath );
Char [] sep = {};
While (sr1.Peek ()! =-1)
{
String [] keyValue = sr1.ReadLine (). Split (sep );
WordLib [keyValue [0] = Convert. ToInt32 (keyValue [1]);
}
Return;
}