Hidden Markov model (Hidden Markov MODEL,HMM) was originally published by L. E. Baum and other scholars in a series of statistical papers, followed by a great deal of value in the areas of language recognition, natural language processing and biological information. At ordinary times, often can contact the relevant articles related to HMM , has not been carefully studied, are Dragonfly water, therefore, want to spend a little time to comb, deepen understanding, in particular thanks to the 52nlp of the detailed introduction of hmm .
Consider the following traffic light example, a sequence may be red-red/orange-green-orange-red. This sequence can be drawn as a state machine, different states according to the state machine alternating each other, each state depends only on the previous state, if the current is a green light, then the next is the orange lamp, which is a deterministic system, so it is easier to understand and analyze, as long as these state transitions are known. But there are many uncertain systems in practice.
In our daily life, we always want to forecast the future weather conditions according to the current weather conditions, unlike the traffic lights above, we can not rely on the existing knowledge to determine the weather conditions, but we still hope to get a weather pattern. One approach is to assume that each state of the model depends only on the previous state, which is called the Markov hypothesis , which can greatly simplify the problem. Obviously, this hypothesis is also a very bad hypothesis, causing a lot of important information to be lost.
When it comes to the weather, the Markov hypothesis is described as assuming that if we know the weather information for some days before, then we can predict today's weather. Of course, this example is somewhat unrealistic. However, such a simplified system can benefit our analysis, so we generally accept this assumption because we know that such a system will allow us to obtain some useful information, though not very accurate.
When it comes to hmm, the Markov process , which is named after the Russian mathematician Andre Markov, is first introduced in this paper, which represents a discrete stochastic process with Markov properties in mathematics. In this process, the transfer of each State depends only on the previous N states, the process is called the 1 N-order model, where n is the number that affects the transition state. The simplest Markov process is a first-order process in which the transfer of each State depends only on the previous state. Note that this is not the same as a deterministic system, because the transfer is probabilistic, not deterministic.
The Markov chain is a random variable X1, ..., a sequence of Xn . The range of these variables, the set of all their possible values, is called the "state space", while the value of Xn is in the state of Time N . If the conditional probability distribution of Xn+1 for a past state is only a function of Xn , then
Here x is a state in the process. The above identity can be regarded as a Markov property .
The Markov chain has played an important role in many applications, for example, the Web sorting algorithm (PageRank) used by Google is defined by the Markov chain .
Shows all possible first-order shifts in the weather example:
Note that a first order process with N states has N2 state transitions. The probability of each transfer is called the state Transition probability (transition probability), which is the probability of moving from one state to another state. All of these N2 probabilities can be represented by a state transition matrix , which is represented as follows:
The following constraints apply to the matrix:
The following is the state transition matrix for the seaweed example:
This matrix indicates that if yesterday is sunny, then today there are 50% may be sunny, 37.5% probability is cloudy, 12.5% of the probability of rain, it is obvious that the matrix of each row and are 1.
In order to initialize such a system, we need an initial probability vector:
This vector indicates that the first day is sunny.
Here, we define the following three parts for the first-order Markov process above:
status : Sunny, cloudy and rainy
initial vector : Defines the probability of the state of the system at time 0
state transition Matrix : probability of each weather transition
All systems that can be described are a Markov process .
But what should we do when the Markov process is not strong enough? In some cases, the Markov process is not enough to describe the pattern we want to discover.
For example, a reclusive person may not be able to visually observe the weather, but folklore tells us that the state of the algae is associated with weather conditions in some probability. In this case we have two state sets, a set of states that can be observed (the state of the algae) and a hidden state (weather condition). We hope to find an algorithm that can predict weather conditions based on the condition of the algae and Markov assumptions .
A more realistic example is speech recognition, the sound we hear is the result of the joint effect of vocal cords, throats, and other pronounced organs together. These factors work together to determine the sound of each word, and the sound that a speech recognition system detects (which can be observed) is produced by various physical changes within the human body (the hidden state, extending the meaning that a person really wants to express).
Some speech recognition devices use the internal pronunciation mechanism as a hidden state sequence, and the final sound as a sequence of observable states that are very similar to the hidden state sequence. In these two examples, a very important place is that the number of hidden states and the number of States that can be observed may be different. In a 3-State weather system (sunny, cloudy, rainy), it may be possible to observe 4 moist algae (dry, dryish, damp, soggy). In speech recognition, a simple speech may only need 80 morphemes to describe, but an internal pronunciation mechanism can produce less than 80 or more than 80 different sounds.
In these cases above, it is possible to observe that the state sequence and the hidden state sequence are probability dependent . We can then model this type of process as having a hidden Markov process and a set of states that can be observed in relation to the probability of the hidden Markov process . This is the hidden Markov model, which is mainly introduced in this paper .
Hidden Markov model (Hidden Markov models) is a statistical model used to describe a Markov process with hidden unknown parameters. The difficulty is to determine the implicit parameters of the process from observable parameters, and then use these parameters for further analysis. is a three-state hidden Markov model state transition diagram, where x represents the implied state, Y represents the observable output, a represents the state transition probability, and b represents the output probability.
Shows the relationship between the hidden states in the weather example and the states that can be observed. We assume that the hidden state is a simple first-order Markov process , and that they can be converted to each other between 22.
There are three important assumptions about HMM, although these assumptions are unrealistic.
hypothesis 1: Markov hypothesis (state formation of first-order Markov chains)
hypothesis 2: the hypothesis of inactivity (state is independent of time)
hypothesis 3: output independence hypothesis (output is only related to the current state)
There is a probability relationship between the hidden state and the observable state, i.e. some hidden state H is considered to be an observable state O1 is probabilistic, assuming P (O1 | H). If there are 3 states that can be observed, then obviously P (O1 | H) +p (O2 | H) + P (O3 | H) = 1.
This way, we can also get another matrix, called the Confusion matrix (Confusion matrix). The content of this matrix is the probability that a hidden state is observed separately into several different observable states, in the weather example, the matrix such as:
The above illustration emphasizes the state change of HMM . And it is clear that the evolution of the model, where the green circle represents the hidden state, the purple circle indicates that the state can be observed, the arrows indicate the probability of dependence between states, a HMM can be a 5-tuple {n, m,π, a,b }, where N Indicates the number of hidden states, we either know the exact value, or guess the value, M represents the number of observable states, can be obtained through the training set, π={πi} is the initial state probability, A={aij} is the hidden state of the transfer matrix Pr (XT (i) | Xt-1 (j)), b= {Bik} indicates the probability of a state that is observable at some point by a hidden state, namely the confusion Matrix , Pr (OT (i) | XT (J)). Each of the probabilities in the state transition matrix and the confusion matrix is time-independent, that is, when the system evolves, these matrices do not change over time. For a stationary N and M hmm , the hmm parameter is represented by λ={π, A, B}.
In the normal Markov model , the state is directly visible to the observer. The conversion probability of such a state is the full parameter. In the hidden Markov model , the state is not directly visible, but some variables affected by the state are visible. Each state has a probability distribution on the symbols that may be output. So the sequence of output symbols can reveal some information about the state sequence.
There are three typical problems in HMM :
(i) Known model parameters, calculating the probability of a given observable state sequence
Suppose we already have a particular hidden Markov model λ and a set of observable state sequences. We may want to know the probability of a given observable sequence of states under all possible sequences of hidden states. When a sequence of hidden states is given as follows:
In the case of HMM and this hidden state sequence, the probability of observing a state sequence is:
The probability of the hidden state sequence under the HMM condition is:
Therefore, the federated probabilities of the hidden state sequence and the observable state sequence are:
Then, on all possible hidden states sequences, the probability of observing the sequence of states is:
For example, we may have a "Summer" model of seaweed and a "Winter" model, because the algae in the summer and winter states should be different, we hope that according to an observable state (the moisture or not of seaweed) sequence to determine whether it is summer or winter.
We can use the forward algorithm to calculate the probability of the next observable state sequence in a particular HMM , and then find the most probable model accordingly.
This type of application usually appears in the voice setting, usually we use a lot of HMM, each for a particular word. A sequence of observable states is obtained forward from a word that can be heard, and then the word can be identified by finding a HMM that satisfies the maximum probability of the observable state sequence.
The following describes the forward algorithm (Forward algorithm)
How can I calculate the probability of an observable sequence?
1. Exhaustive search
Given a HMM, we want to calculate the probability of an observable sequence. Considering the weather example, we know a HMMthat describes the state of the weather and algae, and we also have a sequence of algae states. Given that three days in this state are (dry,damp,soggy), every day of these three days, the weather may be sunny, cloudy or rainy, we can use to describe the observation sequence and the hidden sequence:
Each column in this diagram represents the possible state of the weather, and each state points to each state of the adjacent column, and each state transition has a probability in the state transition matrix. Below each column is the state of the observable algae of the day, and the probability of this observable state appearing in each state is given by the confusion matrix .
One possible way to calculate observable probabilities is to find the sequence of every possible hidden state, where there are 32 = 27, the probability of an observable sequence at this time is Pr (dry, damp, soggy | HMM) =PR (dry, damp, soggy | Sunny, sunny, sunny) + .... + Pr (dry, damp, soggy | Rainy, rainy, rainy).
Obviously, the efficiency of this calculation is very low, especially when the state of the model is very high or the sequence is very long. In fact, we can use the assumption that the probability does not change over time to reduce the cost of time.
2. Use recursive return to reduce complexity
We can consider the probability of a given HMM in the case of a recursive calculation of an observable sequence. We can first define a partial probability that represents the probability of reaching an intermediate state. Next we'll see how these partial probabilities are calculated at time=1 and time = N (n > 1).
Suppose that an observable sequence of a T time period is:
1) Partial probability
The following diagram shows the first-order transfer of an observation sequence (Dry,damp,soggy)
We can calculate the probability of reaching an intermediate state by calculating the probability of all the paths to a certain state. For example, t=2 time, the probability of cloudy is represented by the sum of the probabilities of three paths:
We use Αt (j) to indicate the probability that the T-moment is a state J, Αt (j) =PR (observation state | Hidden state j) x Pr (t time to reach all paths of state J).
A partial probability of the last observed state indicates the probability of all possible paths to a state at the end of the sequence, for example, in this case, the state of the last column is calculated from the following path:
Because part of the probability of the last column is the probability of all possible paths, so is the probability of this observation sequence under a given HMM .
2) Calculate the partial probability of t=1 time
When T=1, there is no path to a state, so here is the initial probability, Pr (state J | t=0) =π (state j), so that we can calculate the partial probability of the t=1 time is:
Because at the initial time, the probability of the state J is not only related to the State itself, but also to the state of observation, so the value of the confusion matrix is used here, K1 represents the first observed state, BJK1 indicates that the hidden state is J, but the probability of observing K1.
3) Calculate the partial probability of t>1 time
or see the formula for calculating partial probabilities is: Αt (j) = PR (Observation state | Hidden state j) x Pr (t time to reach all paths of state J). The left side of this formula is known from the confusion matrix, I just need to calculate the right part, and obviously the right side is all paths and:
The number of paths that need to be computed is related to the square of the length of the observed sequence, but part of the probability of the t moment has already been computed for all the previous paths, so it is only possible to calculate at the moment of the t+1 by the probability of the T moment:
The simple explanation here is that BJK (t+1) is the probability that the first J hidden state of the t+1 moment is considered to be the current observation state, and the latter part is the probability of the transition of the hidden state of all t moments to the hidden state of J at t+1 time. In this way, each step of the calculation can take advantage of the results of the previous step, saving a lot of time.
4) Formula derivation
5) Reduce computational complexity
We can compare the complexity of exhaustive and recursive algorithms. Suppose there is a HMM, where there are n hidden states, we have an observation sequence of length T.
The exhaustive algorithm needs to calculate all possible hidden sequences:
Need to calculate:
It is obvious that the time cost of the exhaustive algorithm is related to the T index, that is, NT, and if the recursive algorithm is used, since each of us can take advantage of the result of the previous step, it is related to t linearly, that is, the complexity is n2t.
Our goal here is to calculate the probability of an observable sequence under a given HMM . We save time by recursively calculating the probability of all the paths of the whole sequence by first calculating the partial probability. At the time of t=1, the probability of the initial probability and the confusion matrix is used, and the probability at the T moment can take advantage of the results of the t-1 moment.
So that we can calculate the probability of all possible paths in a recursive way and, finally, all the part probabilities are calculated as
Using the weather example, the probability method of calculating the cloudy state of the t=2 moment
We use a forward algorithm to calculate the probability of an observable sequence under a given HMM . forward algorithm mainly uses the idea of recursion, using the previous calculation results. With this algorithm, we can find a model that best satisfies the current observable sequence in a bunch of HMM (the probability that the forward algorithm calculates is the largest).
(ii) Finding the most probable sequence of hidden states based on a sequence of observable states
Similar to the one above and, more interestingly, hidden sequences are found based on observable sequences. In many cases, we are more interested in hidden states because they contain valuable information that cannot be directly observed. In the case of algae and weather, for example, a reclusive person can only see the state of the algae, but he wants to know the state of the weather. At this point we can use the Viterbi algorithm to get the best possible hidden state sequence based on the observable sequence, provided there is already a HMM.
Another area in which the Viterbi algorithm is widely used is speech tagging in natural language processing. Words in a sentence can be observed, and part of speech is a hidden state. By finding the most likely hidden state sequence of a word sequence in a sentence based on the context of a statement, we can get a word's part of speech (the most likely). So that we can use this information to do some other work.
Here's a look at the Viterbi algorithm (Viterbi algorithm)
A How can I find the most likely hidden state sequence?
Usually we have a specific HMM, and then we can find a sequence of hidden states most likely to generate this observable state sequence based on an observable sequence of states.
1. Exhaustive search
We can see the relationship of each hidden state and observable state in.
By calculating the probabilities of all possible hidden sequences, we can find a hidden sequence with the greatest likelihood, the most likely hidden sequence maximizes the Pr (observation sequence | Hidden state set). For example, for an observable sequence in (dry damp soggy), the most likely hidden sequence is the largest of the following probabilities:
PR (dry, damp, soggy | Sunny, sunny, sunny), ..., PR (dry, damp, soggy | Rainy, rainy, rainy)
This method is feasible, but the computational cost is very high. As with the forward algorithm , we can reduce the computational complexity by using the invariance of the transfer probability in time.
2. Use recursion to reduce complexity
Given an observable sequence and hmm, we can consider recursion to find the most probable hidden sequence. We can first define a partial probability δ, that is, the probability of reaching an intermediate state. Next we will discuss how to calculate the partial probabilities of t=1 and T=n (n>1).
Note that part of the probability here is not the same as some of the probabilities in the forward algorithm , where a partial probability represents the probability of a path that is most likely to reach a certain state at the T moment, rather than the sum of all probabilities .
1) Partial probability and partial optimal path
Consider the following diagram and the first-order transfer of the observable sequence (dry, damp, soggy)
There is one most likely path for each intermediate state and termination state (t=3). For example, the three states at the time of T=3 have one of the following most probable paths:
We can call these paths part of the optimal path . These partial optimal paths have a probability, that is, partial probability δ. Unlike the partial probabilities in the forward algorithm , the probability here is just the probability of the most probable path, not the probability of all paths.
We can use Δ (I, t) to denote the probability of the most probable sequence in the T-moment, to all possible sequences (paths) of state I, and some of the optimal paths are the paths to this maximum probability, and for each state of each moment there is such a probability and a partial optimal path.
Finally, by calculating the maximum probability and partial optimal path of each state of the t=t moment, we select the most probable state and its partial optimal path to get the global optimal path.
2) Calculate the partial probability of t=1 moment
When the t=1 moment is reached, the maximum possible path to a state does not exist, but we can directly use the probability of a state at t=1 moment and the transition probability of that state to the observable sequence K1:
3) Calculate the partial probability of t>1 moment
Then we can find some probability of the t moment based on the partial probability of the t-1 moment.
We can calculate the probability of all paths to the state X, and find the most probable path, that is, the local optimal path. Notice here that the path to X is bound to go through a, B, and C at the t-1 moment, so we can take advantage of the previous results. The most likely path to X is one of the following three:
(Status sequence), ..., a,x (status sequence), ..., b,x (status sequence), ..., c,x
All we need to do is find the one with the greatest probability in the path ending with AX, BX, and CX.
According to the hypothesis of first-order Markov , the occurrence of a state is related to a state before it, so the probability of the last occurrence of x in a sequence depends only on one of its previous states:
Pr (to reach the optimal path of a). Pr (X | A). Pr (Observation status | X
With this formula, we can take advantage of the results of the t-1 moment and the data of the state transfer matrix and the confusion matrix:
By generalizing the above expression, we can get the calculation formula of the maximum part probability of the first state of the T-moment observable State kt:
where Aji represents the probability of moving from state J to state I, bikt indicates the probability that state I is observed as KT.
4) Back pointer
Consider
In each intermediate state and end state there is a partial optimal probability δ (i, t). But our goal is to find the most probable sequence of hidden states, so we need a way to remember each node of the optimal path.
Taking into account the partial probabilities of the T-moment, we only need to know part of the probability of the t-1 moment, so we just have to record the state that causes the maximum part of the t moment, that is, at any moment, the system must be in a state where the maximum part probability can be generated at the next moment. As shown in the following:
We can use a back pointer φ to record the previous state that caused the maximum local probability of a state, i.e.
Here Argmax means to maximize the J-value of the latter formula, and it is also possible to find that this formula is related to the partial probability and transfer probability of the t-1 moment, since the back pointer is just to find "where I am from" and the problem is not related to the observable state, so there is no need to multiply the confusion matrix factor. The global behavior is as follows:
5) Advantages
There are two important advantages to decoding an observable state using the Viterbi algorithm:
A) reduce complexity by using recursive return, which is the same as the previous forward algorithm
b) The optimal hidden sequence can be found according to the observable sequence, the formula is:
which
Here is a left-to-right translation process, with the result of the previous translation, the starting point is the initial vector π.
3. Supplement
However, some methods may be far from the correct answer when there is noise disturbance in the sequence somewhere. However, the Viterbi algorithm looks at the entire sequence to determine the most likely termination state, and then uses the back pointer to find the previous state, which is useful for ignoring isolated noise.
The Viterbi algorithm provides a very efficient method of calculating hidden sequences based on observable sequences, which uses recursive regression to reduce computational complexity, and uses all previous sequences to make judgments that can tolerate noise well.
During the calculation, the algorithm calculates a partial probability of each state at each moment, and uses a back pointer to record the maximum possible previous state to reach the current state. Finally, the most likely termination state is the last state of the hidden sequence, and then the full state of the entire sequence is looked up by a back pointer.
(iii) Finding the most probable HMM based on the observed sequence set.
In many practical cases,HMM can not be directly judged, this becomes a learning problem, because for a given observable state sequence O, there is no way to accurately find a set of optimal HMM parameter λ to P ( O | λ) is the largest, so people seek to make its local optimal solution, and the Forward backward algorithm (also known as the Baum-welch algorithm) has become a HMM learning Problem of an approximate solution.
The forward backward algorithm first estimates the hmm parameters, but this is probably a false guess, and then updates the hmm parameters by evaluating the validity of these parameters for the given data and reducing the errors caused by them. This makes the error of the given training data smaller, which is actually the idea of gradient descent in machine learning.
For each state in the grid, the forward backward algorithm computes both the forward probability of reaching this state and the backward probability of generating the final state of the model, which can be efficiently calculated using recursion in the previous introduction. These intermediate probabilities can be adjusted by using approximate HMM model parameters, and these adjustments form the basis of the forward backward algorithm iteration.
In addition, the forward backward algorithm is a special case of em algorithm , it avoids the brute force calculation of em algorithm , and uses the dynamic programming idea to solve the problem, Jelinek in its book "Statistical Methods For Speech recognition, the relationship between the forward backward algorithm and the EM algorithm is described in detail, and interested readers can refer to the book.
Similar to the forward algorithm mentioned above, we can also define the back variable βt (i) to calculate the probability of partial observation sequence Ot+1,ot+2,...,ot given the current hidden state I, namely:
Similar to the forward algorithm, we can calculate βt (i) efficiently by iterative algorithm, which is as follows:
which
Further we can find
So
The forward backward algorithm is introduced below.
First we need to define two auxiliary variables, which can be defined using the forward and back variables described in the previous article.
The first variable is defined as the probability of status J at T when state I and t+1, i.e.
The relationship represented by the variable in the grid is as follows:
The equation is equivalent to
With forward and back variables, the formula can be expressed as
The second variable is defined as a posteriori probability, that is, the probability of state I at T, given the observed state sequence and the HMM , i.e.
With forward and back variables, the formula can be expressed as
Therefore, the following formula is the expectation of state I at any moment, that is, from the state I to the observation state O
Similarly, the following is the expectation of moving from state I to State J
We can find that the relationship between the two variables defined is
The following is an introduction to the parameter learning process of the forward backward algorithm , which continuously updates the HMM parameters during the learning process, thus making P (O |λ) the largest. We assume that the initial hmm parameter is λ={π, A, B }, first calculate the forward variable α and the back variable β, and then calculate the desired ξ and ζ according to the formula just introduced, and finally, update the hmm parameters according to the following 3 weight estimation formulas 。
If we define the current hmm model as λ={π, a,b }, then the model can be used to calculate the right end of the above three formulas, and then we define the re-estimation of the hmm model, then the left end of the above three formulas is revalued HMM model parameters. Baum and his colleagues proved in the 70, so if we iteratively calculate the above three formulas, and thus continually re-estimate the hmm parameters, then we can get a maximum likelihood estimate of the hmm model after multiple iterations. However, it is important to note that the maximum likelihood estimate obtained by the forward backward algorithm is a local optimal solution.
Resources:
1. http://blog.csdn.net/eaglex/article/details/6376826
2. Http://en.wikipedia.org/wiki/Markov_chain
3. Http://en.wikipedia.org/wiki/Hidden_Markov_model
4. Lawrence R Rabiner, A Tutorial on Hidden Markov Models and Selected applications in Speech recognition. proceedingsof the IEEE, (2), p. 257–286, February 1989.
5. L. R. Rabiner and B. Juang, "An Introduction to HMMs," IEEE ASSP Mag., vol. 3, No. 1, pp. 4-16, 1986.
6. http://jedlik.phy.bme.hu/~gerjanos/HMM/node2.html
7. http://www.cs.brown.edu/research/ai/dynamics/tutorial/Documents/HiddenMarkovModels.html
8. Introduction to Hidden Markov models, Liu Qun
Turn: Hidden Markov model (HMM) strategy