Statistical learning Note nine----EM algorithm

Source: Internet
Author: User
Tags constant
Preface

The EM algorithm is an iterative algorithm, which was presented by Dempster and others in 1977 for maximum likelihood estimation of probabilistic model parameters with implicit variables (hidden variable), or maximum posteriori probability estimation. Each iteration of the EM algorithm consists of two steps: E-step, desired (expection), M-Step, Max (maximization), so this algorithm is called the desired maximal algorithm (Exception maximization algorithm), or the EM algorithm. Maximum likelihood estimation

Maximum likelihood estimation is only a kind of probability theory applied in statistics, and it is one of the methods of parameter estimation. It is known that a random sample satisfies a certain probability distribution, but the specific parameters are not clear, the parameter estimation is through several experiments, observe the results, using the results to introduce the approximate value of the parameters. The maximum likelihood estimation is based on the thought that a given parameter can make the probability value of the sample appear the most, we will certainly not choose the other small probability of the sample, so simply take this parameter as the estimated real value.

Maximum likelihood estimation you can think of it as a counter-push. In most cases we calculate the result based on a known condition, and the maximum likelihood estimate is the one that already knows the result and then seeks to make that result the most probable condition, as an estimate. For example, if other conditions are certain, smokers who are at risk of lung cancer are 5 times times more likely to be non-smokers, then if I now know that a person is lung cancer, I would like to ask you whether this person smokes or smokes. How do you judge. You probably don't know anything about this person, and the only thing you're aware of is that smoking is more prone to lung cancer, so would you guess that person doesn't smoke? I believe you are more likely to say that this man smokes. Why. This is "the greatest possible", I can only say that he "most likely" is smoking, "He is smoking" this estimate is "most likely" to get "lung cancer" results. This is the maximum likelihood estimate.

General steps to find the maximum likelihood function estimate:

(1), write out the likelihood function,
(2), log likelihood function take logarithm, and arrange,
(3), derivative number, make derivative 0, obtain likelihood equation
, (4), solve likelihood equation, get parameter is to seek.

Maximum (maximal) likelihood estimation is also an example of empirical risk minimization (RRM) in statistical learning. If the model is a conditional probability distribution, the loss function is defined as a logarithmic loss function, and the empirical risk minimization is equivalent to the maximum likelihood estimation.

In general, the maximum likelihood estimation is the method used to estimate the model parameters in the case of a given model (which contains unknown parameters) and a sample set. The basic idea is to find the best model parameters, so that the model to achieve the maximum degree of sample fitting, that is, the possibility of the sample set is the most likely. This approach is based on the idea that the model parameters we estimate are the most likely to produce this given sample. In the maximum relief estimate, we try to find the best parameters in the case of a given model, which makes this group of samples most likely to appear. To give an extreme example, if we get a sample of the Chinese population, the ratio of men to women is 3:2, and now let you estimate the true proportion of the nation, you certainly will not be estimated as male: female =1:0. Because if it is 1:0, it is impossible to get a sample of 3:2. Most of us are easily also estimated to be 3:2, why. sample estimate overall, in fact, the idea behind it is the maximum likelihood.

If you still do not understand the maximum likelihood, you can refer to the following example link:
Refer to a
Reference two
introduction of reference three EM algorithm

Let's look at the introduction to the book:

Let's write the formula for the middle derivation:



From the above three equations can be obtained:

Note: The random variable y here is the observation variable, indicating that the result of an experimental observation is 1 or 0; the random variable z is the implicit variable, which represents the result of the actual unobserved toss of a coin A; it is the model parameter. This model is the generation model of the above data. Note that the data of the random variable y can be observed, and the data of the random variable z is not observable.

If we use these n independent experiments to represent the observed data as vectors and to represent the unobserved data as vectors, the likelihood function of the observed data is:

Since the observed n vectors are distributed independently, the probability is the form of a multiplication:

The maximum likelihood estimation of the model parameters is considered, which means that an optimal parameter is obtained to maximize the probability of the above conditions (or the maximum probability of the logarithm condition). That

There is no analytic solution to this problem, only the iterative method is used to solve it. The EM algorithm is an iterative algorithm that can be used to solve this problem. The solution process is as follows, in fact, to understand the solution process, you will find that the idea of EM algorithm and the coordinate ascending algorithm or SMO algorithm idea is consistent, in fact, the essence of their thinking is the same.

In general, the data of the random variable is represented by Y, and Z is the data that hides the random variable. Y and Z are collectively called full data (Complete-data), and observational data y is also known as incomplete data (Imcomplete-data). Assuming the probability distribution of the given observational data y, which is the model parameter that needs to be estimated, then the likelihood function of incomplete data y is, the logarithmic likelihood function is; assuming that the joint probability distribution of Y and Z is, then the logarithmic likelihood function of the complete data is.

The EM algorithm is based on the maximal likelihood estimation of iteration, and each iteration consists of two steps: E step, expectation, M step, and maximum. The EM algorithm is described below.

Let's go straight to the chart, (those formulas I really don't want to edit ...)

Derivation of EM algorithm

Suppose we have a sample set {x (1),..., X (M)}, which contains m independent samples. But the class Z (i) corresponding to each sample I is unknown (equivalent to clustering), which is also the implied variable. So we need to estimate the parameter θ of the probability model P (x,z), but because it contains the implied variable z, it is difficult to solve with the maximum likelihood, but if z knows, then we can easily solve it.

For parameter estimation, we essentially want to obtain a parameter θ that maximizes the likelihood function, and now it is only possible to have an unknown variable z in the likelihood function with the maximum likelihood difference. That is, our goal is: to find the right theta and Z to make L (θ) the largest . Then we may think, you are an unknown variable just ah, we can also respectively to the unknown variable θ and z, respectively, and then to the deviation of 0, the solution is not the same.

In essence we need to maximize (1) formula (1), we recall the solution of the edge probability density function of a variable under the joint probability density, and note that Z is also a random variable. For each of the possible categories of the sample I z to the right of the equality of the joint probability density function and the equation to the left of the random variable x edge probability density, that is, the likelihood function, but you can see there is "and" logarithm, after the derivation form will be very complex (you can imagine the log (F1 (x) + F2 (x ) + F3 (x) + ...) Complex functions, so it is difficult to solve the unknown parameters z and θ. OK, can we make some changes to the (1) formula? We see (2), (2) formula is only the numerator denominator multiplied by an equal function, or there is a "sum of the logarithm" ah, or can not solve, then why do you do so. Let's take a look at (3) and find that (3) becomes "logarithmic and", so it's easy to take a derivative. We notice that the equal sign becomes an equal, and why can it be so changed. This is the place where the Jensen inequality is greatly apparent. Jensen Inequalities

Set F is a function that defines a field as a real number, and f is a convex function if the two derivative of all real x,f (x) is greater than or equal to 0. When x is a vector, if its Hessian matrix H is semi-positive, then f is the convex function. If it is only greater than 0, not equal to 0, then the "F" is a strictly convex function.

The Jensen inequalities are expressed as follows:
If f is a convex function, X is a random variable, then: E[f (x)]>=f (E[x])
In particular, if f is a strictly convex function, the equation is taken when and only if X is a constant.

If you use a diagram, it will be clear:

In the figure, the real line f is the convex function, x is a random variable, the probability of 0.5 is a, and the probability of 0.5 is B. (Just like a coin toss). The expected value of X is the median of a and B, and E[f (x)]>=f (e[x]) can be seen in the figure.

When F is a (strict) concave function, and only if-F is a (strict) convex function.

when a Jensen inequality is applied to a concave function, it is reversed in an equal direction.
Return to the formula (2) because F (x) =log x is a concave function (its two derivative is -1/x^2<0).

(2) is expected in the formula (considering that it is the function of x, then, again, so you can get the inequality of the equation (3):

OK, here, now the formula (3) is easy to derivative, but the formula (2) and formula (3) is not equal to Ah, the maximum value of the formula (2) is not the maximum value of the formula (3) Ah, and we want the maximum value of the formula (2), how to do it.

Now we need a little imagination, the above formula (2) and formula (3) inequalities can be written: the likelihood function number L (θ) >=j (z,q), then we can continue to maximize the Nether J, so that l (θ) continuously improve, and eventually reach its maximum value.

As shown above, we fix θ, adjust q (z) to increase the Nether J (z,q) to equal to the L (θ) at this point θ (green curve to blue curve), then fix Q (z), adjust θ to make the Nether J (z,q) reach the maximum (to), then fix θ, adjust Q (z) ... Until the θ* is convergent to the maximum value of the likelihood function L (θ). here are two questions: when the Nether J (z,q) is equal to the L (θ) at this point θ. Why it must converge.

First of all, in the Jensen inequality, when the argument x is constant, the equation is set. And here, that is:

Again, since q is the probability density function of the random variable z (i), it can be obtained: the numerator equals the C (the numerator denominator sums all Z (i): the sum ratio of the denominator of multiple equality molecules is the same, this is considered as the two probability ratio of each sample is C), then:

At this point, we have introduced the calculation formula of Q (z), which makes the nether pull up after the fixed parameter θ, to solve the problem of how Q (z) is selected. This step is e-step, which establishes the lower bound of the L (θ). The next M-step is to adjust θ after the given q (z), to make the Nether J of the L (θ) greater (the nether can be adjusted even larger after fixing Q (z)). Then the general EM algorithm steps are as follows:

algorithm flow of em:

Initialize the distribution parameter θ;

repeat the following steps until convergence:

E step: According to the parameter initial value or the model parameter of the last iteration, the posterior probability of the recessive variable is calculated, which is the expectation of the recessive variable. As the current estimate for hidden variables:

M step: Maximize the likelihood function to obtain the new parameter value:

With this constant iteration, you can get the parameter θ that maximizes the likelihood function L (θ). Then you have to answer the second question just now, will it converge?

Perceptual saying, because the nether is constantly improving, so the maximum likelihood estimate monotonically increases, then finally we will reach the maximum likelihood estimate max value. Rational analysis of the words, you will get the following things:

See the derivation process reference for concrete how to prove

Next I post a few pictures (the final reference section also gives the link), is one of my concerns written by a person, to help you understand:

Understanding EM algorithm using the coordinate rising method

The path of the straight-line iterative optimization in the figure, you can see that each step will be further ahead of the optimal value, and that the forward route is parallel to the axis, because each step only optimizes one variable.

This is like finding the extremum of a curve in the X-y coordinate system, but the curve function cannot be directly derivative, so what gradient descent method does not apply . However, after one variable is fixed, the other one can be obtained by derivation, so we can use the coordinate rising method to fix one variable at a time, to find the other extremum, and then to approximate the extremum gradually. Corresponds to EM, e-step: fixed θ, optimizes q;m step: fixed q, optimizes θ, alternately pushes extremum to maximum. examples to help understand EM algorithms: instance One

This is an example of a coin toss, h indicates a positive upward, t indicates a negative upward, and the parameter θ indicates the probability of facing upward. There are two coins, A and B, and the coins are biased. In this experiment, a total of 5 groups, each group randomly selected a coin, 10 consecutive throws. If you know which coin to toss, then the calculation of the parameter θ is very simple, as shown in the figure below.

If you don't know which coin to toss each time. Then, we need to use the EM algorithm, the basic steps are: 1, to θ A and θb an initial value, 2, (E-STEP) estimate each group of experiments is the probability of coin a (this group of experiments is the probability of the coin B =1-This group of experiments is the probability of coin a). Calculate each group of experiments, choose a coin and face up to the desired value, select the B-coin and face up to the number of expectations; 3, (m-step) use the expected value of the third step to recalculate Θa and θb;4, when iterating to a certain number of times, or the algorithm converges to a certain precision, the end algorithm, otherwise, Back to step 2nd.

Explain the calculation of the above figure a little bit. The initial value is θa=0.6,θb=0.5.

How did the 0.45 figure come from? By the initial value of two coins of 0.6 and 0.5, it is easy to conclude that the probability of throwing out 5 positive 5 is, 0.45 is 0.449 approximate, indicating that the first group of experiments selected coin is a probability of 0.45. How did the 2.2h,2.2t in the picture come from? , which indicates that the first set of experiments chooses a coin and has a positive upward count of 2.2. The other values are, in turn, and so on. Python Implementation

Now we use Python to implement the first iteration of the above example, which corresponds to (1), (2), (3) of the above diagram, as follows:

#-*-Encoding:utf-8-*-import numpy from scipy.stats import binom #模拟实现实例中的一次迭代过程: coin toss def em_single (priors,observatio NS): One iteration of the EM algorithm:p Aram priors: Initialization Parameters Theta_a,theta_b [Theta_a,theta_b]:p Aram observations: Observation matrix: m*n:m generation Table number of experimental groups, N: Represents the number of experiments per group: return:[new_theta_a,new_theta_b] ' counts={' A ': {' h ': 0, ' t ': 0}, ' B ': {' h ': 0, ' t ': 0}} thet A_a=priors[0] theta_b=priors[1] # E step:for observation in observations:len_observation = Len (o bservation) Num_heads = Observation.sum () #正面 num_tails = len_observation-num_heads# reverse #二项分布求解公式, throw The coin is a two-item distribution contribution_a = BINOM.PMF (num_heads,len_observation,theta_a) Contribution_b = BINOM.PMF (Num_hea
        Ds,len_observation,theta_b) #将两个概率正规化, the probability of getting data from a coin, a B. Weight_a = contribution_a/(contribution_a + contribution_b) Weight_b = Contribution_b/(contribution_a + Contri Bution_b) #更新在当前参数下A, the positive and negative number of B coins produced counts[' A ' [' H '] + = weighT_a * num_heads counts[' A ' [' T '] + = weight_a * num_tails counts[' B ' [' H '] + = Weight_b * num_heads counts[' B ' [' T '] + = Weight_b * Num_tails # M step:new_theta_a = counts[' a ' [' H ']/(counts[' a ' [' H '] + counts[' A

' [' t ']) New_theta_b = counts[' B ' [' H ']/(counts[' B ' [' H '] + counts[' b ' [' t ']) return [New_theta_a,new_theta_b]
                        if __name__ = = ' __main__ ': #采集数据集, coin throw result, 1: H positive; 0: indicates t reverse observations = Numpy.array ([[1,0,0,0,1,1,0,1,0,1], [1,1,1,1,0,1,1,1,1,1], [1,0,1,1,1,1,1,0,1,1], [1,0,1, 0,0,0,1,1,0,0], [0,1,1,1,0,1,1,1,0,1]]) priors=[0.6,0.5] New_thetas=em_single (priors,obse rvations) Print New_thetas

Experimental results: [0.71301223540051617, 0.58133930831366265] and the results on the way are consistent.

In the following, we give the main loop of the EM algorithm: The One-time em method called the iteration before convergence.

"'
#EM算法的主循环
two termination conditions for a given loop: The model parameter change is less than the threshold; the
                     loop reaches the maximum number of times
'
def em (observations,prior,threshold = 1e-6 , iterations=10000): "" "
    em algorithm
    : param observations: Observational data
    : param prior: Model initial value
    : param tol: Iteration End threshold
    : param iterations: Maximum iteration count
    : return: Local Optimal model parameter ""
    "
    iteration = 0 while
    iteration < Iterations:
        new_prior = Em_single (prior,observations)
        delta_change = Numpy.abs (prior[0]-new_prior[0])
        if Delta_change < threshold:
            break
        else:
            prior = new_prior
            Iteration +=1
    #返回最终的权重, and the number of iterations
    return (new_prior,iteration)

Experimental results:

Finally, the probability parameter for the face up of the coin A is: 0.796788759383
finally get the coin b the probability parameter of the face up: 0.519583935675
The final iteration number is: 14

Reference Links:
http://blog.csdn.net/u011300443/article/details/46763743

http://blog.csdn.net/abcjennifer/article/details/8170378

http://blog.csdn.net/zouxy09/article/details/8537620/

https://www.zhihu.com/question/27976634/answer/39132183

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.