Data mining algorithm Summary-EM Algorithm

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Data mining algorithm Summary-EM Algorithm

Author: Liu Weimin

Graduated from: Institute of computing, Chinese Emy of Sciences

Occupation: Search Engine lovers

08:54:52

1.What is EM algorithm?

The EM algorithm is a very important algorithm in machine learning, that is, the Expectation Maximization Algorithm. It mainly includes the following two steps:

Step E: estimate the expected values

M step: Re-Estimate Parameters

This algorithm is mainly used to estimate parameters. Although the EM algorithm can also perform Data Clustering and perform data fitting based on the Gaussian distribution, the iteration speed of the EM algorithm is very slow, which is much worse than that of kmeans, in addition, the clustering effect of kmeans is not much worse than that of em. Therefore, kmeans is generally used for clustering, rather than em.

2.EM algorithm Overview

The algorithm can be clearly explained in some metaphorical sayings. For example, if the master in the canteen has fired a dish and has to divide it into two portions for two people to eat, there is obviously no need to use the exact weight for weighing at on the balance, the simplest way is to randomly divide the dishes into two bowls, and then observe whether there are as many dishes as possible. Then, take out one of the more dishes and put them in another bowl, this process continues iteratively until you cannot see the differences between the two bowls. The EM algorithm is like this. If we estimate that we know the parameters A and B, both of them are unknown at the beginning, and the information of a can be obtained, if we know B, we get. You can consider first assigning an initial value to a to obtain the estimated value of B. Then, starting from the current value of B, re-estimate the value of A. This process continues until convergence.

The EM algorithm is a method proposed by Dempster, laind, and Rubin in 1977 to calculate the maximum likelihood of parameters. It can perform MLE estimation on parameters from an incomplete data set, it is a very simple and practical learning algorithm. This method can be widely used to process defective data, cut the end of the data, and carry noise and other so-called incomplete data (incomplete data ). Assume that the set Z = (X, Y) is composed of X and Y, and Z = (x, y) and X are called complete data and incomplete data respectively. Assume that the joint probability density of Z is parameterized and defined as p (x, y | weight), where Gini represents the parameter to be estimated. The maximum likelihood estimation of likelihood is obtained by finding the maximum value of the log likelihood function L (X; Likelihood) of incomplete data: L (likelihood; x) = log P (x | likelihood) = commit log P (x, y | lead) dy; the EM algorithm consists of two steps: Step E and step m, it maximizes the expectation of the log likelihood function LC (X; Likelihood) of incomplete data by iteratively maximizing the log likelihood function of complete data. LC (X; Likelihood) = log P (x, y | records); assume that the estimates obtained by the algorithm after the T iteration are recorded as records (t), in the (t + 1) iteration, e-step: Calculate the expectation of the log likelihood function of the complete data, as follows: Q (Cosine | likelihood (t) = E {LC (Cosine; z) | X; round (t)}; m-step: obtain the new round by maximizing Q (Limit | limit (t.

By using these two steps alternately, the EM algorithm gradually improves the model parameters, increases the likelihood of parameters and training samples, and ends at a very large point. You can intuitively understand the EM algorithm. It can also be seen as a successive approximation algorithm: you do not know the parameters of the model in advance, you can randomly select a set of parameters or roughly specify an initial parameter λ 0 in advance to determine the most likely state of the parameter set and calculate the probability of the possible results of each training sample, in the current state, the sample modifies the parameters, re-estimates the λ parameter, and re-determines the model state under the new parameter. In this way, through multiple iterations, until a certain convergence condition is met, the model parameters can gradually approach real parameters.

The main purpose of the EM algorithm is to provide a simple iterative algorithm to calculate the posterior density function. Its biggest advantage is its simplicity and stability, but it is easy to fall into the local optimum.

3.Disadvantages of the EM Algorithm

The EM algorithm is slow in iteration and has many times.

The EM algorithm is slow in processing high data volumes and is not accurate enough to estimate the covariance.

(In other words, the EM algorithm is mainly used for parameter estimation, and the covariance only exists in the EM algorithm based on the mixed Gaussian distribution)

4.The following is the EM code I implemented using Matlab.

Close all; clear; clc; % Author: Liu Weimin % unit: Institute of computing, Chinese Emy of Sciences % Email: liuwm@ics.ict.ac.cn % features: the most typical two-dimensional Gaussian class EM Algorithm Implementation % reference document nationality pattern.recognition.and.machine.learning.pdf % time 20091118 % m = 2; % Number of Gaussian N = 50000; % Total number of data samples TH = 0.1; % convergent threshold nit = 200; % maximal iteration nrep = 10; % Number of repetation to find glob Al maximal JD = 0.1; % the step K = 2; % demention of output signal ptime = 0.01; Pi = 3.141592653589793; % in case it is overwriten by smae name variable cond_num = 100; % prevent the singular covariance matrix in simulation data % paramethers for random signal genrator a_real = [1/2; 1/2]; % here you need to manually set mu_real = [3 7; 7 3]; cov_real (:,:, 1) = [0.5 0; 0 0.5 ]; cov_real (:,:, 2) = [0.5 0; 0 0.5]; % generate the data % all the data generated here meets the standard X = [mvnrnd (mu_real (:, 1), cov_real (:,:, 1), N * a_real (1) ', mvnrnd (mu_real (:, 2), cov_real (:,:, 2), N * a_real (2)']; % num = 0; for Cn = 1: N * a_real (1) while (~ (X (1, CN)> 0) & (X (2, CN)> 0) & (x (1, CN) <10) & (X (2, CN) <10) X (:, CN) = mvnrnd (mu_real (:, 1), cov_real (:,:, 1), 1) '; % num = num + 1; end for Cn = N * a_real (1) + 1: n while (~ (X (1, CN)> 0) & (X (2, CN)> 0) & (x (1, CN) <10) & (X (2, CN) <10) X (:, CN) = mvnrnd (mu_real (:, 1), cov_real (:,:, 1), 1) '; % num = num + 1; end % all data generated here meets the standard % % parameter initialization A = [1/3, 2/3]; mu = [2 4; 4 7]; % the mean value has been initialized Cov (:, 1) = [5 0; 0 0.5]; Cov (:,:, 2) = [5 0; 0 0.5]; % covariance initialization % em algorothm % loop f_best =-INF; for CREP = 1: NR EP fprintf ('maximum local value of % 04d/N', CREP); while 1 a_old =; mu_old = mu; cov_old = cov; rznk_p = zeros (m, n); for CM = 1: m mu_cm = Mu (:, CM); cov_cm = Cov (:,:, CM); for Cn = 1: n p_cm = exp (-0.5 * (x (:, CN)-mu_cm) '/cov_cm * (x (:, CN)-mu_cm )); rznk_p (CM, CN) = p_cm; end rznk_p (CM, :) = rznk_p (CM, :)/SQRT (det (cov_cm); end rznk_p = rznk_p * (2 * PI) ^ (-k/2); % E step % start to calculate rznk rznk = zeros (m, n ); % R (z pikn = zeros (1, m); % R (z pikn_sum = 0; for Cn = 1: n for CM = 1: m pikn (1, CM) = a (CM) * rznk_p (CM, CN ); % pikn_sum = pikn_sum + pikn (1, CM); end for CM = 1: m rznk (CM, CN) = pikn (1, CM)/sum (pikn ); end % calculate the end of rank % m step NK = zeros (1, M ); for CM = 1: m for Cn = 1: n N K (1, CM) = NK (1, CM) + rznk (CM, CN ); end A = NK/N; rznk_sum_mu = zeros (M, 1 ); % calculate the mean mu for CM = 1: m rznk_sum_mu = 0; % the error is returned here. Set this parameter to zero. for Cn = 1: n rznk_sum_mu = rznk_sum_mu + rznk (CM, CN) * X (:, CN ); end Mu (:, CM) = rznk_sum_mu/nk (CM ); end % returns covariance cov for CM = 1: m rznk_sum_cov = zeros (K, M ); for Cn = 1: n rznk_sum_cov = rznk_sum_cov + rznk (CM, CN) * (x (:, CN)-mu (:, CM )) * (x (:, CN)-mu (:, CM) '; end Cov (:,:, cm) = rznk_sum_cov/nk (CM); end T = max ([norm (a_old (:)-A (:))/norm (a_old (:)); norm (mu_old (:)-mu (:))/norm (mu_old (:)); norm (cov_old (:)-Cov (:))/norm (cov_old (:))]); disp (t); If T <th break; end end % while 1 F = sum (log (sum (pikn ))); if F> f_best a_best = A; mu_best = mu; cov_best = cov; f_best = F; end end % for CREP = 1: nrep %% output result disp ('A _ best = '); disp (a_best ); disp ('mu _ best = '); disp (mu_best); disp ('cov _ best = '); disp (cov_best);

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Data mining algorithm Summary-EM Algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Data mining algorithm Summary-EM Algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support