Hidden Markov model (HMM) and its extended _ Hidden Markov model

Source: Internet
Author: User
Tags extend

Reprint please specify address (http://blog.csdn.net/xinzhangyanxiang/article/details/8522078)

When learning probabilities, we must all learn Markov models, it was very interesting at that time, then I saw the beauty of mathematics in the application of the hidden horse model in natural language processing, see hidden Markov model can have so many applications, and achieved very good results, more feel incredible, specially in-depth study, summed up here. Markov process

Markov process can be seen as an automaton, with a certain probability of jumping between the various states.

Consider a system in which every moment can be one of n states, and N sets of states are {S1,S2,S3,... SN}. We now use q1,q2,q3,... qn to represent the state of the system at t=1,2,3,... N times. In T=1, the state Q of the system depends on the probability that an initial probability distribution Pi,pi (SN) represents the t=1 when the state of the system is SN.

The Markov model has two hypotheses:

1. The state of the system at the moment T is related only to the state of the moment T-1 (also known as no effect)

2. The state transition probability is independent of time; (also known as homogeneous or time-homogeneous)

The first article can be expressed in the following formula:

P (Qt=sj|qt-1=si,qt-2=sk,...) = P (Qt=sj|qt-1=si)

where T is any number greater than 1, SK is any state

The second hypothesis can be expressed in the following formula:

P (qt=sj|qt-1=si) = P (qk=sj|qk-1=si)

of which, K is any moment.

The following figure is a sample diagram of a Markov process:


The state transition probability can be represented by matrix A, and the determinant length of the matrix is the number of States, and AIJ represents P (si| SI-1). Hidden Markov process

Compared with Markov, the hidden Markov model is a double random process, not only a random event between state transitions, but also a stochastic process between state and output, as shown in the following figure:

This picture is from somewhere else, and maybe the symbol is different from what I described in Markov before, and I'm sure we can understand it.

The graph is divided into two rows, the above line is a Markov transfer process, the following line is the output, that is, we can observe the value, now, we have the above line of Markov transfer in the process of the state is called a hidden state, the following observed value is called the observation state, the set of observed state is represented as O={o1,o2 , O3,... OM}.

Accordingly, Hidden Markov also has a more than Markov assumption that the output is only relevant to the current state, and can be expressed in the following formula:

P (O1,o2,..., ot| S1,S2,..., St) =p (o1| S1) *p (o2| S2) *...*p (ot| ST)

The O1,o2,..., ot is a sequence of observed states from time 1 to time t, and the S1,S2,..., St is a hidden state sequence.

In addition, this assumption is also known as the output independence hypothesis. As an example

Give a common example to draw the following, and to facilitate understanding. For example, I am in different weather conditions to do something different probability, the weather state set for {Rain, cloudy, Sunny}, things set for {curtilage, self-study, play}. If we already have a transfer probability and an output probability that p (weather a| weather B) and P (things a| weather A) are known, then there are a few questions to ask (note, suppose one of my few things in a day),

1. If the weather changes in a week is a rain-> sunny-> cloudy-> rain-> cloudy-> Cloudy, then I this week study-> house-> play-> study-> play-> curtilage-> from The probability of acquisition is much.

2. If I do things this week is a self-study-> curtilage-> play-> self-study-> play-> curtilage-> study,

What is the probability of this sequence of things without knowing the state of the weather?

3. If the weather changes in a week is rain-> sunny-> cloudy-> rain-> cloudy-> sunny-> Cloudy, then we this week the most likely sequence of things.

4. If I work this week is a study-> curtilage-> play-> self-study-> play-> curtilage->, then this week's weather change sequence is most likely.

As for the first question, I think we should all know how to do it soon. What I don't know, the answer is at the end of this paper. The basic elements and three problems of the hidden horse model

To sum up, we can get the basic elements of hidden Markov, that is, a five-tuple {S,N,A,B,PI};

S: Hidden state set;

N: A set of observed states;

A: the transfer probability matrix between hidden states;

B: The output matrix (that is, the probability of the hidden state to the output state);

PI: Initial probability distribution (initial probability distribution of hidden state);

Among them, A,b,pi is called the parameter of Hidden Markov, which is represented by X.

The above problems can lead to two of the three basic problems of hidden Markov, and for simplicity, the hidden Markov model is named HMM (Hiden Markov model).

The three basic questions of HMM are:

1. Given the model (five-tuple), find the probability of an observed sequence O (sample problem 2)

2. Given the model and observation sequence O, the most probable hidden state sequence (sample problem 4) is obtained.

3. For the given observation sequence O, adjust the parameters of the hmm so that the probability of the observed sequence is greatest. Forward algorithm for the first basic problem, the calculation formula is:


That is, for the observation sequence O, we need to find all possible hidden state sequences s, calculate the probability of s output to o under the given model (that is, the sample problem one), and then compute the sum of probabilities.

Intuitively, if the sequence o length is T, the model's hidden state set size is n, then there are NT possible hidden state sequence, the computational complexity of extremely high O (NT), the violence algorithm is too slow.

The solution is dynamic programming (programming).

Suppose the observation sequence is O1,o2,o3,...., Ot. In the moment I (1<i<=t), define C as the probability of generating a sequence O1,o2,..., Oi and Si=sk:

In which, SK is any one of the hidden state values.

The calculated formula for C (i+1,or) is:

Where the SR is any one of the hidden state values. A is a transfer probability. b is the probability of the hidden state to the observed state. To make it easier to understand, look at the picture:


C (3, rain) considers all combinations of t=1 and t=2, and is also a sub problem of C (4, rain | cloudy | sunny days). C (3, Cloudy) and C (3, sunny) are also calculated, while the C (I+1,SR) Calculation formula can be expressed as:


Known from the Picture: C (4, Cloudy) =[c (3, rain) *a (rain, Cloudy) +c (3, Cloudy) *a (Cloudy, Cloudy) +c (3, Sunny) *a (sunny, Cloudy)]*b (Cloudy, self-study).

Through the picture, we should be able to intuitively understand the algorithm, the algorithm is also known as the forward algorithm, there are back to the algorithm. Yes, the backward algorithm is this algorithm is the reverse, but also dynamic planning, here is not to repeat, interested in looking at the reference documents. In addition, there is no explanation as to how to initialize the probability, or you can check it in the reference document. Viterbi algorithm

Now, the first basic problem of hmm is solved, and the second problem is solved, the second is called decoding, the same, the brute force algorithm is the probability of calculating all probabilities, and then find the hidden state sequence with the maximum probability value. Similar to the violent solution of problem one, the complexity is O (NT).

What kind of plan should we use?

There is no doubt that the dynamic planning ah. Suppose the observation sequence is O1,o2,o3,...., Ot. In the moment I (1<i<=t), the maximum probability of the observed sequence is defined D to observe the O1,o2,..., Oi and si=s K:

Among them, S1,s2,.... S (i-1), can also be obtained at this time, because they are child problems ah.

Children's shoes Do you see the difference between the formula and the forward algorithm above. One is the sum of the pairs, one is the maximum of the child problem ah.

Of course, for this problem, because of the need to make observation sequence probability of the largest hidden state of the sequence, rather than the maximum probability, so, in the algorithm calculation process, also need to record the previous hidden state of the value. For example, C (4, cloudy) The maximum value is a child problem C (3, rain), then need to be in C (4, Cloudy) This node record predecessor State for rain.

Because this algorithm and the forward algorithm is only a different formula, so the reference map is the same, the algorithm can also refer to the above algorithm diagram; Similarly, the interpretation does not mention initialization, you can look at the reference.

This algorithm is also called the Viterbi algorithm, Viterbi is the name of the old man in the 70 's invented the algorithm, but in modern times there is no mystery, can be seen in the resolution of the problem may be very simple, so whether the life or academic do not fear, the courage to fight and then know the war easily.

I believe that after understanding the forward algorithm and the Viterbi algorithm, we can solve the sample problem 2 and the sample problem 4, for example problem 3, in fact, is similar to the Viterbi algorithm, but in the observation of the State of the space to find the best solution.

For the basic question three, I have not understood too thoroughly, here is not shortcoming. Application

With so much to say, hmm in the end what is the application of it.

Hmm was first applied in information theory, which was later applied to natural language processing and other aspects of image recognition. The following two examples illustrate his application, one is the input method of the whole sentence decoding, one is speech recognition. There are pictures to testify:

Input method to the pinyin as the observation state, the need to get the Chinese characters for the hidden state, so that the input method of the whole sentence decoding into a Viterbi decoding, the transfer probability is two-yuan language model, its output probability is pronunciation corresponding to the probability of different pinyin.

The phonetic recognition is a problem, the transfer probability is still two-yuan language model, and the output probability is the corresponding model of phonetic model and Chinese character. Extended

Although the HMM model solves the problem, it is already good, but academically, keep improving, and always think of ways to make it better. So there are various extensions for Hmm, and here are two.

One is to extend the time homogeneity of the three hypotheses, that is to say, the probability of state transition is related to time. It also has practical significance in the input method, such as the subject of TA (he, it, she) and the noun ta (tower) and the verb ta (step, collapse) and so on appear in general is not the same, the subject generally appears at the beginning of a sentence or a variety of subordinate clauses, for example, we will say "who he is" and rarely say "who is the tower" (Do not rule out the name of a particular wonderful person only a tower word), so that we consider ' ta ' shi ' shui ' this phonetic string, the first word ta consider him, it, her probability will be larger, the probability of the tower will be smaller.

In this regard, the paper "A non-time-homogeneous hidden Markov model in the application of phonetic word conversion" mentioned a method of implementation, statistical language model, the use of words in the position of the sentence as a position to calculate the average position of words. When using the language model of the phonetic word conversion, the probability of the language model is estimated by using a function of the word position and the average position. The formula is as follows:

The PML (W1|W2) is the transfer probability of the maximum likelihood estimation, f (.) is the weight function.

Another way to extend hmm is to extend the no-validity hypothesis. It is assumed that a state is only relevant to the previous state, so that only the two-dollar model in the language model can be used, and that a state is now assumed to be related to the top two or more states, so that the high-meta language model can be used. Now we consider only the first two, then this is although the use of the ternary model, but the Viterbi algorithm is the calculation of the problem, because now the probability of the state of T moment not only to consider the state of t-1 moment, but also consider the state of t-2 time.

The method used to solve the problem of the Viterbi algorithm in the ternary model (also becomes the second order hmm problem) is that the second-order hmm problem is converted into the first-rank hmm problem by merging the two states. Merge Second Order hmm

For merging second-order Hmm, you can look at the following figure:

For the sake of simplicity, I changed the hidden state to two, raining and sunny. As you can see from the diagram, when t>=2, the nodes hold some bar points, the number of which is the number of States in the previous state, and the value of the subsection point is the maximum probability that a state sequence can be produced when the state is SR and the previous moment is SK. For example, the background of the small node of the significance of the value of the moment 3 for the rain, time 2 for the rain to go to self-study-> curtilage-> play the maximum probability. (Note that the node represents a state at time I, and a subsection point represents the node of the previous state held in the node, such as the green node).

For moment I (i>2), the probability of each small node is

Then for the moment i+1, the probability of the subsection point is:



Then, look for the largest small node backtracking from the moment T.

Sample Question one answer

The answer to the first question in the examples above is:

Probability p=p (Rain) *p (Sunny | rain) *...*p (Cloudy | Sunny) *p (self-study | rain) *p (Home | Sunny) *...*p (self-study | cloudy)

The first P (rain) is the initial probability distribution (remember the probability distribution of Markov's t=1 moment). )。


Reference documents:

A non-time-aligned hidden Markov model and its application in audio-word conversion

Research and application of statistical language model

The application of Chinese input method in the language model combining statistics and rules

Research and implementation of whole sentence input algorithm based on Markov chain

References and this PDF download address: Bash here

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.