The report of Chinese word segmentation system
I. Background of research
With the rapid development of the Internet, information has also been an explosive growth trend. In the vast amount of information, how to quickly extract effective information has become a must to solve the problem. Because of the repeatability of information processing, and the computer is good at dealing with mechanical, repetitive, regular work, so naturally think of using computers to help people to deal with. In the use of computer for natural language processing, the main use is based on statistical methods, and the actual use of the good results achieved.
Because the characteristics of the Chinese sentence-no delimiter to separate the words in the sentence, so in the Chinese processing, the first thing to do is how to Chinese sentence segmentation. This is also the project to achieve the function.
In this project, the implementation is a word breaker system. The main content of the system is to establish the hidden Markov model, using the People's Daily corpus to train the model parameters, then use the Viterbi algorithm to find the most likely implicit sequence, and finally the input sentence into the form of a word.
Second, model method
The main use of this project is the hidden Markov model and Viterbi algorithm.
The Hidden Markov model is a statistical model that can be represented by a 5-tuple: {s,o,π,a,b}. The following is a description of the academic meaning and engineering meaning of the five-tuple of the Hidden Markov model, and the understanding of the meaning of the five-tuple in the actual project by comparison:
Hmm Five Elements |
Academic implications |
Engineering implications |
S |
Implied turn state |
4 states of the word: prefix, words, endings, Tango words |
O |
Observation status |
All Chinese characters in the corpus |
Pi |
Initial state probability matrix |
Initial probabilities of various implied states |
A |
Implicit state transition Probability matrix |
Transfer probabilities of 4 implied states |
B |
Observing the state transition probability matrix |
Probability of each kanji to four states |
In this project, four possible states are set for each Chinese character: prefix ([/b]begin), Word ([/m]middle), suffix ([/e]end), and Tango Word ([/s]single]).
According to the state of the setting, give an example to illustrate five parameters:
Assume that the input statement is: I'm Chinese.
s={/b,/M,/E,/S}
o={, new, filling, full 、......} (all non-repeating characters in the corpus)
Π={p (Me | B), P (i | M), P (i | E), P (i | S)}
A=
|
/b |
/M |
/E |
/S |
/b |
0 |
0.3 |
0.7 |
0 |
/M |
... |
... |
... |
... |
/E |
... |
... |
... |
... |
/S |
... |
... |
... |
... |
b=
|
I |
Is |
In |
Country |
People |
/b |
0.3 |
|
|
... |
|
/M |
|
... |
... |
|
... |
/E |
... |
|
|
0.6 |
|
/S |
|
... |
|
|
... |
The probabilities mentioned above can be obtained from the corpus according to the statistics.
Third, the system design
The system is divided into two parts, and one part is to train the required documents through Corpora. This section is only executed once. The other part is to build the specific model parameters according to the input statement (as can be seen above, according to the corresponding probability of the specific input), and then perform the Viterbi algorithm to find the best hidden state sequence. The final word segmentation results are obtained based on the implied state sequence.
The development language of the system is C + +. C + + is a bit inconvenient in dealing with Chinese--a byte for English characters, two bytes for Chinese (you can tell if the character is less than zero, whether it's ASCII or Chinese characters). But finally, through some skills to solve the problem of C + + processing of Chinese inconvenience.
1. Corpus Database Processing
(1) Removing the part of speech from the original corpus
A, the original corpus:
B, processed corpus (added a space in front of each line and removed the part of speech)
C, processing flow chart
(2) Count the number of words appearing in each state
A, the design of the data structure is as follows:
structnode{stringName//Save a single word intQuantity//number of occurrences of a word BOOL operator==(ConstNode &a) { returnname==A.name; } }; structWord {stringName//Status Name Long LongNum//Number of status occurrencesList<node>Chinese; BOOL operator==(ConstWord &a) { returnname==A.name; } BOOLFINDCH (stringch) {Node temp; Temp.name=ch; Temp.quantity=1; List<node >:: Iterator it; It=Find (Chinese.begin (), chinese.end (), temp); if(it==Chinese.end ()) {Chinese.push_back (temp); } Else{It->quantity++; } return true; } };
B, processing steps
A, read a line of string from the corpus, and then traverse the string to get a Chinese text
b, to determine whether the word is a space before and after the word corresponding to the state (S: Before the space is not, M: not before and after the space; E: The front is not a space followed by a space; S: Spaces before and after)
C, according to the state of the word, determine whether the word appears in this state. Yes, the number of corresponding words plus 1, no, insert a new node and the number is set to 1)
D, read to end of file end
C, after the end to get the following documents
(3) Transition between statistical States to obtain the state transfer matrix
A. Statistics the number of times and the total number of transitions between states in the corpus, and calculates the corresponding probabilities
B, the input corpus of the step is as follows:
c The 4*4 state transfer matrix can be obtained after the process is finished.
2, Viterbi algorithm decoding, to find the best implicit sequence
(1) The Viterbi algorithm is a dynamic programming algorithm. In this project, through the previous state of the current state, the probability of the current state appearing in the condition of the previous state is calculated, and the maximum value is taken as the probability of the current state. It is possible to calculate the maximum probability of which state appears when the last word is calculated by iterating. Finally, the best hidden state sequence is obtained by backtracking.
(2) The algorithm pseudo-code is as follows:
Four, system demonstration and analysis
1. Test examples and results
2. Analysis of results
(1) Goods and services->besbe-> goods/services/
(2) China won the game->besbesbesbe-> China/In/competition/In/won/won/won/
(3) Participle description: According to the Viterbi algorithm obtained the implicit sequence, sequential output, when the word is in the E state or S state, after the word added '/', output can see the effect of the word.
(4) Because each word has a state, so in the process of word segmentation, it is possible that the original word is separated, the original is not the word synthesis of the word, causing the wrong participle. For example, the "Tomorrow" above was taken apart, and the "day" was synthesized. For example, "Monk" and "not yet" are separated, even if there are two words in the thesaurus.
3. Improvement Plan
This project only relies on hmm to realize, so there must be some defects. In order to improve the system, it can be combined with other word segmentation methods, in the process of hmm or after the completion of the implementation of further analysis, in order to get better segmentation effect.
Five. Thoughts, opinions and suggestions on this course
The six-week natural language comprehension course ended with a complete finish. In this course, I opened my eyes, learned a lot of things I didn't know before, and learned a lot about natural language understanding. Although a lot of things are not mastered, but after all, it is long insight, broaden the field of vision. In addition, through the course experiment, in the huge amount of information to find their own content, read a lot of blogs, papers (see very rough), the hidden Markov model has a certain understanding. In the actual programming practice, also found the shortcomings of their own programming. A lot of harvest.
Just six weeks of study is not very long, in this process, is not a one-time mastery of how many models, one-time learning to how many natural language processing methods, but to broaden your horizons, take you in the natural language processing in the world to appreciate a.
Vi.. References
1. http://www.tuicool.com/articles/FRZ77b using statistics to analyze Chinese word segmentation and part of speech
2. Research on Chinese POI word segmentation system based on n shortest path and hidden Markov model Tang Yu
3. Research on Chinese word segmentation method based on inverse hidden Markov model
4, http://blog.csdn.net/sight_/article/details/43307581 Hidden Markov model detailed
The realization of simple Chinese word segmentation system