Maximum expectation algorithm (EM)
the K-mean algorithm is very simple (see the previously published blog post), which can be easily understood by the detailed reader. But the EM algorithm to be introduced below is much more difficult, which is closely related to the maximum likelihood estimation.
1 algorithm principle
Let's start with an example, assuming that there are now 100 height data, and that the 100 data are randomly extracted. A common sense is that male height satisfies a certain distribution (for example, normal distribution), female height also satisfies certain distribution, but the parameters of these two distributions are different. Not only do we not know the parameters of the height distribution of men and women, we don't even know which of the 100 data are from males and which are from females. This is in line with the assumption that the clustering problem, in addition to the data itself, does not know any other information. And our goal is to infer which category each data belongs to. So for each sample, there are two items that need to be estimated, one is whether it comes from the male height distribution or the female height distribution. The other is the number of parameters of the height distribution of men and women.
Since we want to estimate that both A and B sets of parameters, both are unknown in the starting state, but if you know the information of a, you can get B's information, and in turn know that B will get a. So one of the possible ways to think of is to give a certain initial value to get the estimate of B, and then from the current values of B to re-estimate the value of a, the process continues until convergence. Do you have a vague idea of what? Yes, this is exactly the essence of the K-means algorithm, so the K-means algorithm actually contains the essence of the EM algorithm.
EM algorithm, also known as the desired maximization (expectation maximization) algorithm. In the question of the height of men and women, you can first casually guess the normal distribution parameters of male height: for example, you can assume that the average male height is 1.7 meters, the variance is 0.1 meters. Of course, this is just one of our guesses, and certainly not very accurate at first. But based on this speculation, it can be calculated that each person is more likely to be male or female distribution. For example, a person's height is 1.75 meters, obviously it is more likely to belong to the male height of this distribution. Accordingly, we have a attribution for each piece of data. Then, according to the maximum likelihood method, the parameters of male height normal distribution are re-estimated by these several data which are presumably considered male, and the same method is re-estimated by the female distribution. Then, when the two distributions are updated, the probability of each of the two distributions is changed again, then the parameters need to be adjusted. So iterate until the parameters are basically no longer changed.
Before formally introducing the principle and execution process of the EM algorithm, the concept of edge distribution is supplemented first.
2. Convergence discussion
In the next article we will discuss the Gaussian mixture model (GMM), which is equivalent to an implementation of EM. An example of data mining in R is given.
Not finished, to be continued ...
This article refers to the literature:
1. Stanford Open Class- machine learning , by Andrew Ng
2, Jerrylead's blog
3. Introduction to Data Mining, pang-ning Tan,michael steinbach,vipin Kumar
The EM algorithm in machine learning and the R language Example (1)