K-means algorithm

Last Update:2016-08-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

K-means algorithm is a clustering algorithm, the cluster is of course unsupervised, given the initial data set $\left \{x_i \right\}_{i=1}^n$, K-means will divide the data into $K $ clusters, each cluster represents a different category,K-means algorithm as follows:

1. Select K centroid from training set $\left \{x_i \right\}_{i=1}^n$, respectively, $\left \{\mu_k \right\}_{k=1}^k$;

2. Repeat the process until it converges:

2.1 For Sample $x _i$, get its category $c _i$: \[c_i = \arg \min_k| | x_i–\mu_k| | ^2\].

2.2 For each cluster $k $, recalculate centroid: \[\mu_k=\frac{\sum_{i=1}^n 1\left \{c_i = k \right \}x_i}{\sum_{i=1}^n 1\left \{c_i = k \right \ }}\]

When the cluster is complete, the $\left \{c_k \right\}_{k=1}^k$ is used to represent the $K $ cluster, a loss can be defined to measure the effect of the cluster, and the loss is used as the stop condition for the iteration, in the form of the following:

\[j = \sum_{k}\sum_{x_i\in c_k}| | x_i-\mu_k| | ^2\]

Iterations can be stopped when the loss of two iterations $J $ basically does not change, or if the sample of each cluster does not change substantially. For the K-means process:

K-means is very simple, there are two key problems in practical application, namely: the selection of K value and the choice of initial centroid, the following are discussed separately:

K-Value selection:

1) Elbow Method: When the selected K value is less than the real, k each increase 1,cost value will be greatly reduced; When the K value is greater than the true K, the change in K per increment of 1,cost value is less obvious. In this way, the correct k value will be at this turning point, similar to the elbow place. Such as

The BIC is calculated as follows:

\[bic =–2 \ln (likelihood) +ln (N) \times k\]

? where n is the number of samples in the dataset, and the number of K features. BIC is a measure of the degree of fit and complexity of the model, and the -2*LN (Likehood) Part of the calculation formula is a measure of the fit of the model, and the greater the value, the worse the fitting degree. The model complexity is measured by ln (N) *k.

The likelihood function is generally calculated by probability, that is, l (θ) =∏p (y|x;θ), is a small value between 0 and 1, combined with the LN function of the image to know that ln (likehood) is a negative, and the likelihood function is smaller, corresponding to an absolute value of the larger negative number, so -2*ln ( Likehood) The larger the model, the worse the fitting fit. If the model has a likelihood function (such as GMM), use Bic, DIC and other decision-making, even if there is no likelihood function, such as Kmean, can also make a false likelihood, for example, with GMM, etc. instead

Initial centroid selection:

1) k-means++ method, first randomly selects a point as the first initial class cluster center point, and then selects the point farthest from the point as the second initial class cluster center point, and then selects the closest distance from the first two points of the largest point as the center point of the third initial class cluster, and so on, Until the center point of the K initial cluster is selected.

2) k-menasii algorithm selects multiple points in each cycle as quasi-centroid (quasi-centroid is likely to be centroid in the future), after the cyclic n times, will select enough quasi-centroids. The number of quasi-centroid is much larger than K, and the number of quasi-centroid selected in each cycle will generally be very large, such as 1000 each time, so that the number of cycles is much smaller than k, the computational efficiency will be much higher. Finally, the quasi-centroid in C can be clustered (using the k-means++ algorithm), and K-centroid in the result of clustering is used as the K-centroid of the original data. In this way, not only the calculation efficiency of centroid is improved, but also the position of the chosen K centroid will be better, because it is the centroid of the cluster generation. The K-means II algorithm is more suitable for parallel computing than the k-means++ algorithm, because it does not require strict selection of K points as centroids, just preselection, so that 1-5 steps into different machines to calculate, and finally the selection of all points in the reducer together, and then clustering, Results are similar to non-parallel computations.

If the effect is not good,

Reference documents

http://kunlun.info/2016/03/25/kmeans_essential/

http://www.cnblogs.com/washa/p/4027284.html k Value Selection

Selection of initial centroid of http://www.cnblogs.com/kemaswill/archive/2013/01/26/2877434.html

Http://www.xuebuyuan.com/2096905.html

Http://blog.jqian.net/post/k-means.html

K-means algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

K-means algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

K-means algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support