NG Machine Learning Video notes (11) Theory of--k-mean value algorithm

Source: Internet
Author: User

NG Machine Learning Video notes (11)

--k - means algorithm theory

(Reproduced please attach this article link--linhxx)

I. Overview

K-Means (K-means) algorithm, is a unsupervised learning (unsupervised learning) algorithm, its core is clustering (clustering), that is, a set of input, through the K-means algorithm classification, output classification results.

Since the K-means algorithm is unsupervised learning algorithm, the input sample is different from the previous one, the input sample is only the sample itself, there is no corresponding sample classification results, that is, the input is only {x (1), X (2),... x (M)}, each x has no corresponding classification result y (i), We need to use algorithms to get the y corresponding to each x.

K-mean algorithm, commonly used scenarios include analyzing the types of users in the market analysis, the relationship between the analysis users in social networks, the analysis of computer cluster design, and the analysis of galaxy formation process.

The difference between unsupervised learning and supervised learning input is as follows:

Second, the basic steps of the algorithm

1. Prerequisites

Now assume that the data points have m, need to have a K classification.

2. Steps

1) randomly initialize K points, as the center point of the K classification, each center point is called the Cluster Center (cluster centroid), also become a cluster. When it is k=2, randomly select 2 clusters.

2) According to the K Center point, traverse all samples, calculate the distance of each sample to each center point separately, and divide the sample into the nearest example.

3) When the classification is complete, the average of the samples in each classification is calculated according to the results of the classification, and the clustering centers are moved to these averages.

4) Repeat step 2, step 3 until the cluster center is stable.

In summary, the steps are as follows:

3. Special Circumstances

When a cluster center is randomly selected, if no sample is divided into this cluster center in a single classification, if there is a fixed number of classification results, then the initial cluster center should be randomly selected, and the K-means value will be re-calculated. For faster speeds, you can store these points without any classification results, and avoid selecting them at random initial cluster centers.

Third, the cost function

1. Symbols

In the K-mean algorithm, K (uppercase) indicates the number of categorical results (k=3 indicates that the sample is divided into 3 classes), K (lowercase) represents the K cluster Center, C (i) represents the cluster center number that the sample x (i) belongs to, Μk represents the location of the K Cluster Center, and ΜC (i) represents the cluster center location where x (i) belongs.

For example, X (i) is classified into a 5th cluster center, then C (i) =5,μc (i) =μ5.

2. Cost function

The cost function of the K-mean algorithm, also known as the Dispulsion function of the K-mean algorithm, has the following formula:

It can be proved that the formula for the cost function:

1) The second step of the K-means algorithm (that is, after selecting the cluster center, each sample needs to be classified into the corresponding cluster center), for the optimization cost function, the equivalent of the fixed μ value, to calculate J (c) optimal condition.

2) The third step of the algorithm (that is, by computing the sample mean of each cluster to re-select the cluster Center), for the optimization of the cost function, the equivalent of μ optimization to determine the optimal J.

Iv. Initializing the cluster center

1. Prerequisites

Random initialization of the cluster center, provided that the total number of categories K is less than the sample number m, otherwise the classification is meaningless.

2. Steps

1) in M samples, K samples are randomly selected.

2) Make μ 1, μ2 ... Μk equals this K-sample.

In simple terms, the cluster center is randomly initialized in the sample, rather than randomly in the entire coordinate system.

3, there is a problem--local minimum value

The cost function of the K-means algorithm also has the local optimal solution (minimum value), which is very bad for the K-means algorithm, as shown in:

On the left is the sample to be classified, the upper right side is according to the daily experience should be categorized, and the right of the following two classification results, are the results of local minima. That is, at the start of random initialization, there are two initialization points are initialized in the original should be divided into a class of "site", which leads to later optimization, will continue to be in the wrong way to optimize.

4. Solutions

In order to avoid the local minimum value, the K-mean algorithm can be used more than once. Each time the classification is stable, the way to record the classification, and the final cost function under the classification method (that is, the sum of squares of the distance from each point to the center point of the cluster), and then take these classification results, with the lowest value of the classification, as the final classification method.

As shown in the following:

Generally speaking, when the K-means algorithm is executed more than 50 times, the result of the error classification of local optimal solution can be avoided eventually.

In addition, the local optimal solution, generally only in the K relatively small (k in 2~10 around) when it is easy to occur, when K is very large, usually does not occur local optimal solution, or the difference between the local optimal and the best solution is not very large, acceptable.

V. Determining the number of categories K

When you are unsure about how many classes to divide, you need to determine the number of classifications K.

1. Method One: Elbow rule

The law of the Elbow (elbow Method), that is, by changing the K value, the image of K and the cost function J, and to determine at some k, the cost function is the most reduced (image similar to human elbow), then take this k as the expected classification results, as shown in the image on the left.

But there is a problem with the elbow law, as shown on the right, when the image of the k-j, there is no sharp reduction of the point (from the image can not be clear which K is the elbow), but relatively gentle reduction, it is not possible to choose which K is the best by the elbow rule.

2. Method Two

When the elbow rule is not able to determine K, the more general way is to analyze the current business scenario and to define the desired classification results through the business scenario.

For example, depending on the height and weight of the population distribution, the design needs to determine the size of T-shirts. For example, it can be divided into 3 categories (left), 5 classes (right), and the classification is reasonable.

At this point, it is necessary to identify what kind of k should be chosen as the best classification through empirical analysis.

Vi. Summary

K-means algorithm, as unsupervised learning method, its thinking and analysis methods, and supervised learning algorithms have a relatively large difference. It should be clear that supervised learning is to inform the classification results in advance, while unsupervised learning does not inform the classification results in advance, they have different business scenarios, so the inevitable design ideas are completely different.

--written by Linhxx

More recent articles, please pay attention to the public number "machine learning", or scan the right QR code.

NG Machine Learning Video notes (11) Theory of--k-mean value algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.