Data mining algorithm Summary-EM Algorithm

Source: Internet
Author: User

Data mining algorithm Summary-EM Algorithm

 

Author: Liu Weimin

Graduated from: Institute of computing, Chinese Emy of Sciences

Occupation: Search Engine lovers

08:54:52

 

 

1.What is EM algorithm?

The EM algorithm is a very important algorithm in machine learning, that is, the Expectation Maximization Algorithm. It mainly includes the following two steps:

Step E: estimate the expected values

M step: Re-Estimate Parameters

This algorithm is mainly used to estimate parameters. Although the EM algorithm can also perform Data Clustering and perform data fitting based on the Gaussian distribution, the iteration speed of the EM algorithm is very slow, which is much worse than that of kmeans, in addition, the clustering effect of kmeans is not much worse than that of em. Therefore, kmeans is generally used for clustering, rather than em.

2.EM algorithm Overview

The algorithm can be clearly explained in some metaphorical sayings. For example, if the master in the canteen has fired a dish and has to divide it into two portions for two people to eat, there is obviously no need to use the exact weight for weighing at on the balance, the simplest way is to randomly divide the dishes into two bowls, and then observe whether there are as many dishes as possible. Then, take out one of the more dishes and put them in another bowl, this process continues iteratively until you cannot see the differences between the two bowls. The EM algorithm is like this. If we estimate that we know the parameters A and B, both of them are unknown at the beginning, and the information of a can be obtained, if we know B, we get. You can consider first assigning an initial value to a to obtain the estimated value of B. Then, starting from the current value of B, re-estimate the value of A. This process continues until convergence.

The EM algorithm is a method proposed by Dempster, laind, and Rubin in 1977 to calculate the maximum likelihood of parameters. It can perform MLE estimation on parameters from an incomplete data set, it is a very simple and practical learning algorithm. This method can be widely used to process defective data, cut the end of the data, and carry noise and other so-called incomplete data (incomplete data ). Assume that the set Z = (X, Y) is composed of X and Y, and Z = (x, y) and X are called complete data and incomplete data respectively. Assume that the joint probability density of Z is parameterized and defined as p (x, y | weight), where Gini represents the parameter to be estimated. The maximum likelihood estimation of likelihood is obtained by finding the maximum value of the log likelihood function L (X; Likelihood) of incomplete data: L (likelihood; x) = log P (x | likelihood) = commit log P (x, y | lead) dy; the EM algorithm consists of two steps: Step E and step m, it maximizes the expectation of the log likelihood function LC (X; Likelihood) of incomplete data by iteratively maximizing the log likelihood function of complete data. LC (X; Likelihood) = log P (x, y | records); assume that the estimates obtained by the algorithm after the T iteration are recorded as records (t), in the (t + 1) iteration, e-step: Calculate the expectation of the log likelihood function of the complete data, as follows: Q (Cosine | likelihood (t) = E {LC (Cosine; z) | X; round (t)}; m-step: obtain the new round by maximizing Q (Limit | limit (t.

By using these two steps alternately, the EM algorithm gradually improves the model parameters, increases the likelihood of parameters and training samples, and ends at a very large point. You can intuitively understand the EM algorithm. It can also be seen as a successive approximation algorithm: you do not know the parameters of the model in advance, you can randomly select a set of parameters or roughly specify an initial parameter λ 0 in advance to determine the most likely state of the parameter set and calculate the probability of the possible results of each training sample, in the current state, the sample modifies the parameters, re-estimates the λ parameter, and re-determines the model state under the new parameter. In this way, through multiple iterations, until a certain convergence condition is met, the model parameters can gradually approach real parameters.

The main purpose of the EM algorithm is to provide a simple iterative algorithm to calculate the posterior density function. Its biggest advantage is its simplicity and stability, but it is easy to fall into the local optimum.

3.Disadvantages of the EM Algorithm

The EM algorithm is slow in iteration and has many times.

The EM algorithm is slow in processing high data volumes and is not accurate enough to estimate the covariance.

(In other words, the EM algorithm is mainly used for parameter estimation, and the covariance only exists in the EM algorithm based on the mixed Gaussian distribution)

4.The following is the EM code I implemented using Matlab.

Close all; <br/> clear; <br/> clc; <br/> % Author: Liu Weimin <br/> % unit: Institute of computing, Chinese Emy of Sciences <br/> % Email: liuwm@ics.ict.ac.cn <br/> % features: the most typical two-dimensional Gaussian class EM Algorithm Implementation <br/> % reference document nationality pattern.recognition.and.machine.learning.pdf <br/> % time 20091118 <br/>%< br/> m = 2; % Number of Gaussian <br/> N = 50000; % Total number of data samples </P> <p> TH = 0.1; % convergent threshold <br/> nit = 200; % maximal iteration <br/> nrep = 10; % Number of repetation to find glob Al maximal <br/> JD = 0.1; % the step <br/> K = 2; % demention of output signal <br/> ptime = 0.01; <br/> Pi = 3.141592653589793; % in case it is overwriten by smae name variable </P> <p> cond_num = 100; % prevent the singular covariance matrix in simulation data <br/> % paramethers for random signal genrator <br/> a_real = [1/2; 1/2]; % here you need to manually set <br/> mu_real = [3 7; <br/> 7 3]; <br/> cov_real (:,:, 1) = [0.5 0; <br/> 0 0.5 ]; <Br/> cov_real (:,:, 2) = [0.5 0; <br/> 0 0.5]; <br/> % generate the data % all the data generated here meets the standard <br/> X = [mvnrnd (mu_real (:, 1), cov_real (:,:, 1), N * a_real (1) ', mvnrnd (mu_real (:, 2), cov_real (:,:, 2), N * a_real (2)']; </P> <p> % num = 0; <br/> for Cn = 1: N * a_real (1) <br/> while (~ (X (1, CN)> 0) & (X (2, CN)> 0) & (x (1, CN) <10) & (X (2, CN) <10) <br/> X (:, CN) = mvnrnd (mu_real (:, 1), cov_real (:,:, 1), 1) '; <br/> % num = num + 1; <br/> end <br/> for Cn = N * a_real (1) + 1: n <br/> while (~ (X (1, CN)> 0) & (X (2, CN)> 0) & (x (1, CN) <10) & (X (2, CN) <10) <br/> X (:, CN) = mvnrnd (mu_real (:, 1), cov_real (:,:, 1), 1) '; <br/> % num = num + 1; <br/> end <br/> % all data generated here meets the standard <br/> % % parameter initialization <br/> A = [1/3, 2/3]; <br/> mu = [2 4; 4 7]; % the mean value has been initialized <br/> Cov (:, 1) = [5 0; <br/> 0 0.5]; <br/> Cov (:,:, 2) = [5 0; <br/> 0 0.5]; % covariance initialization <br/> % em algorothm </P> <p> % loop <br/> f_best =-INF; <br/> for CREP = 1: NR EP <br/> fprintf ('maximum local value of % 04d/N', CREP); <br/> while 1 <br/> a_old =; <br/> mu_old = mu; <br/> cov_old = cov; <br/> rznk_p = zeros (m, n); <br/> for CM = 1: m <br/> mu_cm = Mu (:, CM); <br/> cov_cm = Cov (:,:, CM); <br/> for Cn = 1: n <br/> p_cm = exp (-0.5 * (x (:, CN)-mu_cm) '/cov_cm * (x (:, CN)-mu_cm )); <br/> rznk_p (CM, CN) = p_cm; <br/> end <br/> rznk_p (CM, :) = rznk_p (CM, :)/SQRT (det (cov_cm); <br/> end <br/> rznk_p = rznk_p * (2 * PI) ^ (-k/2); <br/> % E step <br/> % start to calculate rznk <br/> rznk = zeros (m, n ); % R (z <br/> pikn = zeros (1, m); % R (z <br/> pikn_sum = 0; <br/> for Cn = 1: n <br/> for CM = 1: m <br/> pikn (1, CM) = a (CM) * rznk_p (CM, CN ); <br/> % pikn_sum = pikn_sum + pikn (1, CM); <br/> end <br/> for CM = 1: m <br/> rznk (CM, CN) = pikn (1, CM)/sum (pikn ); <br/> end <br/> % calculate the end of rank <br/> % m step <br/> NK = zeros (1, M ); <br/> for CM = 1: m <br/> for Cn = 1: n <br/> N K (1, CM) = NK (1, CM) + rznk (CM, CN ); <br/> end <br/> A = NK/N; <br/> rznk_sum_mu = zeros (M, 1 ); <br/> % calculate the mean mu <br/> for CM = 1: m <br/> rznk_sum_mu = 0; % the error is returned here. Set this parameter to zero. <Br/> for Cn = 1: n <br/> rznk_sum_mu = rznk_sum_mu + rznk (CM, CN) * X (:, CN ); <br/> end <br/> Mu (:, CM) = rznk_sum_mu/nk (CM ); <br/> end <br/> % returns covariance cov </P> <p> for CM = 1: m <br/> rznk_sum_cov = zeros (K, M ); <br/> for Cn = 1: n <br/> rznk_sum_cov = rznk_sum_cov + rznk (CM, CN) * (x (:, CN)-mu (:, CM )) * (x (:, CN)-mu (:, CM) '; <br/> end <br/> Cov (:,:, cm) = rznk_sum_cov/nk (CM); <br/> end <br/> T = max ([norm (a_old (:)-A (:))/norm (a_old (:)); norm (mu_old (:)-mu (:))/norm (mu_old (:)); norm (cov_old (:)-Cov (:))/norm (cov_old (:))]); <br/> disp (t); <br/> If T <th <br/> break; <br/> end </P> <p> end % while 1 <br/> F = sum (log (sum (pikn ))); <br/> if F> f_best <br/> a_best = A; <br/> mu_best = mu; <br/> cov_best = cov; <br/> f_best = F; </P> <p> end <br/> end % for CREP = 1: nrep </P> <p >%% output result <br/> disp ('A _ best = '); <br/> disp (a_best ); <br/> disp ('mu _ best = '); <br/> disp (mu_best); <br/> disp ('cov _ best = '); <br/> disp (cov_best); </P> <p>

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.