Statistical-based language models have a natural advantage over rule-based language models, while (Chinese) word segmentation is the basis of natural language processing, next, we will introduce statistics-based Chinese Word Segmentation and part-of-speech tagging. To this end, make the following arrangements: first introduce the basic concepts involved in Chinese processing, and then analyze some open-source Chinese Word Segmentation principles based on statistics.
The basic concepts involved in Chinese Word Segmentation include Markov chain, Hidden Markov Model (HMM), Ngram model, maximum entropy Markov Model (memm), and Conditional Random Field (CRF ).
1. Markov Chain
In general, Markov chains refer to the fact that in a State Space sequence, the current State is only prior to n (n = 1, 2 ,......) Status.
The specific definition is as follows:
Markov chains are random variables with Markov properties, such as x1, x2, X3 ,... Sequence, the upcoming status is only related to the current status, but not to the past status.
The mathematical formula is as follows:
Pr (xn + 1 = | X1 = x1, x2 = x2 ,..., Xn = xn) = Pr (xn + 1 = x | xn = xn)
Xn (n = 1, 2, 3 ,...) Indicates a set of all possible values, known as "state space", while the XN value is in the state of time n.
A Markov chain is usually described as a directed graph. The State indicates the vertex of the graph, and the State Transfer Probability indicates the edge of the graph. 1.
Figure 1-Markov Chains in two states
2. Hidden Markov Model
Hmm Definition
A hmm is a triple.
Initial state probability Vector
A = (AIJ): State Transfer Probability; PR (XI | XJ)
B = (BIJ): confusion matrix; PR (yi | XJ)
Among them, all the State transfer probabilities and obfuscation probabilities remain unchanged throughout the system. This is also the most impractical assumption in HMM.
An HMM model is mainly represented by two States and three sets of probabilities.
Two statuses: observation and hiding
Three sets of probabilities: initial probability, state transition probability, and two-state probability.
Example
We use a part-of-speech tagging example to illustrate the principle of HMM.
Observation status: He is a computer doctor.
Hidden state: pronoun, verb, noun
Assume that the conversion between the two States of the hidden state is as follows based on the corpus. We also call it the State transfer probability matrix.
|
Pronoun |
Verb |
Term |
Pronoun |
0.5 |
0.25 |
0.25 |
Verb |
0.375 |
0.125 |
0.375 |
Term |
0.125 |
0.625 |
0.375 |
Based on the corpus, we can also obtain the probability matrix of the two States, that is, the confusion matrix, as shown below:
|
He |
Yes |
Computer |
Doctor |
Pronoun |
0.60 |
0.20 |
0.15 |
0.05 |
Verb |
0.25 |
0.25 |
0.25 |
0.25 |
Term |
0.05 |
0.10 |
0.35 |
0.50 |
At the same time, we assume that the initial probability is as follows:
Pronoun verb noun
[0.63 0.17 0.20 〕
So far, we have trained a part-of-speech tagging HMM Model Based on corpus statistics.
What can we do with the HMM model?
(1) evaluation is to find the probability of an observed sequence based on known hmm. For example, we can evaluate (he is a computer doctor) the probability of appearance. We can use the Forward Algorithm algorithm to obtain the probability that the observed state sequence corresponds to a hmm.
(2) find the hidden state sequence that generates this sequence based on the observed state sequence, for example, you can find the corresponding "pronoun verb noun" Sequence Based on the "He is a computer doctor" sequence. We can solve this problem through viterbialgorithm.
How can we evaluate and decode the Hidden Markov Model Based on the trained hidden Markov model?
This is the most difficult issue related to hmm. Based on an observed sequence (from a known set) and a hidden state set related to it, estimate the most suitable hidden Markov model (HMM), that is, to determine the most appropriate (, a, B) Triple for the description of known sequences. When matrices A and B cannot be measured directly (estimated), forward-backward algorithms (forward-backward algorithm) are used for learning (parameter estimation ), this is also common in practical applications.
Because the accuracy of learning directly using the forward-backward algorithm is not very high, the common practice is to generate hmm by manually tagging the corpus. However, note that manual corpus tagging requires a large amount of work.
Disadvantages of Hidden Markov Model:
There are two assumptions in the HMM Model: one is that the output observed values are strictly independent, and the other is that the current State is only related to the previous state (the first-order Markov Model) during the state transition process ).
In the following example
Observation status: He is a computer doctor.
Hidden state: pronoun, verb, noun
For example, when calculating the probability of Pr (Doctor | term), the context information "computer" of "doctor" is not considered "; in addition, the probability of the first "noun" appearing is only related to the probability of the previous "verb" appearing. In this way, the ability to indicate context information is limited.