K-means algorithm

Source: Internet
Author: User

K-means algorithm is a clustering algorithm, the cluster is of course unsupervised, given the initial data set $\left \{x_i \right\}_{i=1}^n$, K-means will divide the data into $K $ clusters, each cluster represents a different category,K-means algorithm as follows:

1. Select K centroid from training set $\left \{x_i \right\}_{i=1}^n$, respectively, $\left \{\mu_k \right\}_{k=1}^k$;

2. Repeat the process until it converges:

2.1 For Sample $x _i$, get its category $c _i$: \[c_i = \arg \min_k| | x_i–\mu_k| | ^2\].

2.2 For each cluster $k $, recalculate centroid: \[\mu_k=\frac{\sum_{i=1}^n 1\left \{c_i = k \right \}x_i}{\sum_{i=1}^n 1\left \{c_i = k \right \ }}\]

When the cluster is complete, the $\left \{c_k \right\}_{k=1}^k$ is used to represent the $K $ cluster, a loss can be defined to measure the effect of the cluster, and the loss is used as the stop condition for the iteration, in the form of the following:

\[j = \sum_{k}\sum_{x_i\in c_k}| | x_i-\mu_k| | ^2\]

Iterations can be stopped when the loss of two iterations $J $ basically does not change, or if the sample of each cluster does not change substantially. For the K-means process:

K-means is very simple, there are two key problems in practical application, namely: the selection of K value and the choice of initial centroid, the following are discussed separately:

K-Value selection:

1) Elbow Method: When the selected K value is less than the real, k each increase 1,cost value will be greatly reduced; When the K value is greater than the true K, the change in K per increment of 1,cost value is less obvious. In this way, the correct k value will be at this turning point, similar to the elbow place. Such as

2)

The BIC is calculated as follows:

\[bic =–2 \ln (likelihood) +ln (N) \times k\]

? where n is the number of samples in the dataset, and the number of K features. BIC is a measure of the degree of fit and complexity of the model, and the -2*LN (Likehood) Part of the calculation formula is a measure of the fit of the model, and the greater the value, the worse the fitting degree. The model complexity is measured by ln (N) *k.

The likelihood function is generally calculated by probability, that is, l (θ) =∏p (y|x;θ), is a small value between 0 and 1, combined with the LN function of the image to know that ln (likehood) is a negative, and the likelihood function is smaller, corresponding to an absolute value of the larger negative number, so -2*ln ( Likehood) The larger the model, the worse the fitting fit. If the model has a likelihood function (such as GMM), use Bic, DIC and other decision-making, even if there is no likelihood function, such as Kmean, can also make a false likelihood, for example, with GMM, etc. instead

Initial centroid selection:

1) k-means++ method, first randomly selects a point as the first initial class cluster center point, and then selects the point farthest from the point as the second initial class cluster center point, and then selects the closest distance from the first two points of the largest point as the center point of the third initial class cluster, and so on, Until the center point of the K initial cluster is selected.

2) k-menasii algorithm selects multiple points in each cycle as quasi-centroid (quasi-centroid is likely to be centroid in the future), after the cyclic n times, will select enough quasi-centroids. The number of quasi-centroid is much larger than K, and the number of quasi-centroid selected in each cycle will generally be very large, such as 1000 each time, so that the number of cycles is much smaller than k, the computational efficiency will be much higher. Finally, the quasi-centroid in C can be clustered (using the k-means++ algorithm), and K-centroid in the result of clustering is used as the K-centroid of the original data. In this way, not only the calculation efficiency of centroid is improved, but also the position of the chosen K centroid will be better, because it is the centroid of the cluster generation. The K-means II algorithm is more suitable for parallel computing than the k-means++ algorithm, because it does not require strict selection of K points as centroids, just preselection, so that 1-5 steps into different machines to calculate, and finally the selection of all points in the reducer together, and then clustering, Results are similar to non-parallel computations.

If the effect is not good,

Reference documents

http://kunlun.info/2016/03/25/kmeans_essential/

http://www.cnblogs.com/washa/p/4027284.html k Value Selection

Selection of initial centroid of http://www.cnblogs.com/kemaswill/archive/2013/01/26/2877434.html

Http://www.xuebuyuan.com/2096905.html

Http://blog.jqian.net/post/k-means.html

K-means algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.