A detailed description of hidden Markov model

Last Update:2018-07-26 Source: Internet

Author: User

Tags extend

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Reprint please specify address (http://blog.csdn.net/xinzhangyanxiang/article/details/8522078)

Learning probability, we must have learned Markov model, it was very interesting, and then saw the beauty of the mathematical model of the hidden horse in the natural language processing application, see the Hidden Markov model can have so many applications, and achieved good results, more sense of magic, specifically in-depth study of a bit, summed up here. Markov process

Markov process can be seen as an automaton, with a certain probability to jump between the various states.

Consider a system that may be in one of N states at each moment, and n sets of states {S1,S2,S3,... SN}. We now use Q1,Q2,Q3,... qn to indicate the state of the system at the t=1,2,3,... n time. When T=1, the system's state Q depends on the probability that an initial probability distribution Pi,pi (SN) indicates the System state is SN at t=1.

Markov models have two assumptions:

1. The system at the moment T's state is only related to the state of the moment t-1; (also known as no-effect)

2. State transition probabilities are independent of time; (also known as homogeneous or temporal-homogeneous)

The first article can be expressed in the following formula:

P (Qt=sj|qt-1=si,qt-2=sk,...) = P (Qt=sj|qt-1=si)

where T is any number greater than 1, SK is any state

The second hypothesis can be expressed in the following formula:

P (qt=sj|qt-1=si) = P (qk=sj|qk-1=si)

where k is at any moment.

The following diagram is a sample diagram of a Markov process:

The state transition probability can be expressed in matrix A, the column length of the matrix is the number of States, AIJ represents P (si| SI-1). Hidden Markov process

Compared with Markov, the hidden Markov model is a double stochastic process, which is not only a random event between state transitions, but also a stochastic process between state and output, as shown in the following figure:

This picture is from elsewhere, perhaps the symbol is different from the one I described before, I believe you can understand.

The graph is divided into two lines, the above row is a Markov transfer process, the following line is the output, that is, we can observe the value, now we are the above row of Markov transfer process state is called the hidden state, the following observed value is called the observation state, the observation state of the set is expressed as O={o1,o2 , O3,... OM}.

Correspondingly, the hidden Markov is also more than Markov a hypothesis, that is, the output is only relevant to the current state, can be expressed as the following formula:

P (O1,o2,..., ot| S1,S2,..., St) =p (o1| S1) *p (o2| S2) *...*p (ot| ST)

Among them, O1,o2,..., ot is the observation state sequence from moment 1 to moment T, S1,s2,..., St is the hidden State sequence.

In addition, this hypothesis is also called the output independence hypothesis. As An example ,

Take a common example to elicit the following, and to make it easy for you to understand. For example, I do some things in different weather conditions, the probability of different weather conditions set for {Rain, cloudy, Sunny}, things set for {curtilage, study, play}. If we already have the probability of transfer and the probability of the output, that is, p (weather a| weather B) and P (things a| weather A) are known, then there are a few questions to ask (note, suppose one of my few things in a day),

1. If the weather changes in a week is rainy and sunny, cloudy and rainy day--cloudy, then I this week self-study------------ The probability of learning is much.

2. If I am doing my work this week is self-study--------------------

What is the probability of this sequence of things without knowing the condition of the weather?

3. If the weather changes during the week are cloudy, cloudy, rainy Day, cloudy, sunny, rainy days, what is the most likely sequence of things we have this week?

4. If I am doing a sequence of things this week is self-study----------------------------------

For the first question, I think we should all be able to know how to calculate it quickly. What I don't know, the answer is at the end of this article . Basic elements and three basic problems of hidden horse model

To sum up, we can get the basic elements of hidden Markov, that is, a five-tuple {S,N,A,B,PI};

S: Hidden state collection;

N: Observing the state set;

A: The transition probability matrix between hidden states;

B: The output matrix (that is, the probability of hiding state to the output state);

PI: Initial probability distribution (initial probability distribution of hidden state);

Among them, A,b,pi is called Hidden Markov parameters, denoted by x.

From the above problems, we can draw out two of the three basic problems of hidden Markov, and for the sake of simplicity, the hidden Markov model is abbreviated as HMM (Hiden Markov models).

The three basic questions of HMM are:

1. For a given model (five tuples), the probability of an observed sequence O (example problem 2)

2. Given model and observation sequence O, find the most likely hidden state sequence (example problem 4).

3. For a given observation sequence O, adjust the hmm parameters, so that the probability of the observation sequence appears the most. forward algorithm for the first basic problem, the formula is:

For the observation sequence O, we need to find out all possible hidden state sequences s, calculate the probability of the output of S as O in a given model (that is, a sample problem), and then calculate the sum of probabilities.

Intuitively, if the length of the sequence O is T, the model's hidden state collection size is n, then there are NT possible hidden state sequences, the computational complexity of the very high O (NT), brute force algorithm is too slow.

The solution is dynamic planning (programming).

Suppose that the observation sequence is O1,o2,o3,...., Ot. At moment I (1<i<=t), define C for the probability of generating sequence O1,o2,..., Oi and Si=sk:

Where SK is an arbitrary hidden state value.

Then C (i+1,or) is calculated as:

where SR is any hidden state value. A is the transfer probability. b is the probability of the hidden state to the observed state. To make it easier to understand, look at the picture:

C (3, rain) considers all combinations of t=1 and t=2, and is also a sub-problem of C (4, rainy | cloudy | sunny day). C (3, Cloudy) and C (3, sunny) are also calculated, while the C (I+1,SR) formula can be expressed as:

by figure: C (4, Cloudy) =[c (3, rainy) *a (rain, Cloudy) +c (3, Cloudy) *a (Cloudy, Cloudy) +c (3, Sunny) *a (sunny, Cloudy)]*b (cloudy, self-study).

Through the picture, we should be able to intuitively understand the algorithm, the algorithm is called forward algorithm, and that there is a back algorithm. Yes, the backward algorithm is this algorithm upside down, but also dynamic planning, here do not repeat, interested in reading references. In addition, there is no explanation of how to initialize the probability, you can also go to the reference document to verify. Viterbi Algorithm

Now, the first basic problem of hmm is solved, the second problem is solved, the second problem is also called decoding problem, similarly, the Brute force algorithm is the probability of calculating all probabilities, and then finds the hidden state sequence with the maximum probability value. Similar to the violence solution for problem one, the complexity is O (NT).

What should be the plan?

There is no doubt that the dynamic planning ah. Suppose that the observation sequence is O1,o2,o3,...., Ot. At moment I (1<i<=t), define D as the maximum probability that the observed sequence is generated when O1,o2,..., Oi and Si=sk:

Among them, S1,s2,.... S (i-1), at this time also can be obtained, because they are sub-problem ah.

Children's shoes have you seen the difference between this formula and the forward algorithm above. One is the sum of a pair of questions, one is the maximum value of the sub-problem.

Of course, for this problem, because it is required to make the sequence of the hidden state that makes the most of the probability of observing the sequence, rather than the maximum probability, the value of the previous hidden state needs to be recorded during the algorithm calculation. For example C (4, cloudy) The maximum is a sub-problem C (3, rain) obtained, then need in C (4, Cloudy) This node record the pre-state is raining.

Since this algorithm and the forward algorithm are only the difference between the calculation formula, so the reference graph is the same, the algorithm can also refer to the graph of the above algorithm, the same, the explanation does not mention the initialization, you can go to see the reference.

This algorithm, also known as the Viterbi algorithm, Viterbi is the name of the old gentleman in the 70 's invented the algorithm, but in the modern people seem to be no mystery, visible problems in the solution may be very simple, so whether it is life or academia do not fear, the courage to fight and then know the easy to fight.

I believe that after understanding the forward algorithm and the Viterbi algorithm, we can solve the sample problem 2 and sample problem 4, for example problem 3, in fact, compared with the Viterbi algorithm, but in the space of observation state to find the best solution.

For the basic problem three, I have not understood too thorough, here is not caught dead. Application

Say so much, hmm what is the application of it.

Hmm was applied in the information theory, and then it was applied to natural language processing and other image recognition. The following two examples illustrate his application, one is the input method of the whole sentence decoding, one is speech recognition. There is a picture for proof:

Input method to see pinyin as a state of observation, the need to get the Chinese characters for the hidden state, so that the input method of the whole sentence decoding into a Viterbi decoding, the transfer probability is a two-dollar language model, its output probability is polyphone corresponding to the probability of different pinyin.

The pronunciation of pinyin in the above image is a speech recognition problem, the transfer probability is still two-dollar language model, and its output probability is the voice model, that is, the corresponding model of speech and Chinese characters. Extended

Although the HMM model solves the problem the effect is already very good, but in the academic, strives for excellence, the total thought method makes it become better. So there are various extensions for Hmm, here are two kinds of bar.

One is to extend the time-uniformity of the three hypotheses, that is to say, the probability of state transition is related to time. This also has practical significance in the input method, for example, as the subject of TA (he, it, she) and the noun ta (tower) and the verb ta (stepping, collapse) and other positions are generally not the same, the subject usually appears in the beginning of the sentence or various clauses, for example, we will say "who is he", and rarely say "who is the tower" (Do not exclude the individual wonderful people's name only a tower word), so that we consider ' ta ' shi ' shui ' this pinyin string, the first word ta consider him, it, her probability will be larger, the probability of the tower will be smaller.

In this respect, the paper in the reference literature, "a non-time-homogeneous hidden Markov model in the application of voice-word conversion," a realization method, the statistical language model, the use of words in the position of the sentence as the location of the word in the position of the average position. In the use of the language model of phonetic translation, the probability of the language model is re-estimated using a function of the word position and the average position of pinyin. The formula is as follows:

where PML (W1|W2) is the maximum likelihood estimation of the transfer probability, F (.) is the weight function.

Another way to extend hmm is to extend the no-no-validity hypothesis by assuming that a state is only relevant to the previous state, so that only the two-dollar model in the language model can be used, and now assumes that a state is related to the first two or even more states, so that a high-meta-language model can be used. Now we are thinking about only the first two, then this is although the ternary model is used, but the calculation of the Viterbi algorithm is problematic because the probability of the state of the T moment not only takes into account the state of the t-1 moment, but also the state of the t-2 moment.

To solve the problem of the Viterbi algorithm under the ternary model (also become second-order hmm problem), the method is: two states before and after merging to convert the second phase hmm problem to the first rank hmm problem. merging second -order hmm

For the combined second-order Hmm, you can look at the following diagram:

For the sake of brevity, I changed the hidden state to two, rain and sunny days. As can be seen from the graph, when t>=2, there are some bar points in the node, the number of these bars is the number of States of the previous state, the value of the bar point is the maximum probability that a state sequence can be generated when the state of the moment is SR and the previous moment has a status of SK. For example, the value of a small green node is the meaning of the moment 3 for the rain, time 2 for the rain to self-study-----------the greatest probability of play. (Note that a node represents a state at time I, and a bar point represents a node in the previous state that was saved in the node, such as the green node).

For moment I (i>2), the probability of each small node is

So for the moment i+1, the probability of the bar point is:

Then, look for the largest small node backtracking from the moment T.

an answer to a sample question

The answer to the first question in the example above is:

The first P (rain) is the initial probability distribution (remember the probability distribution of Markov's t=1 moment). ）。

Reference Documents:

A non-time-aligned hidden Markov model and its application in voice and word conversion

Research and application of statistical language model

Application of language model combining statistics and rules in Chinese input method

Research and implementation of integral sentence input algorithm based on Markov chain

References and PDF download address: Punch this place.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More