K-means algorithm is a clustering algorithm, the cluster is of course unsupervised, given the initial data set $\left \{x_i \right\}_{i=1}^n$, K-means will divide the data into $K $ clusters, each cluster represents a different category,K-means algorithm as follows:
1. Select K centroid from training set $\left \{x_i \right\}_{i=1}^n$, respectively, $\left \{\mu_k \right\}_{k=1}^k$;
2. Repeat the process until it converges:
2.1 For Sample $x _i$, get its category $c _i$: \[c_i = \arg \min_k| | x_i–\mu_k| | ^2\].
2.2 For each cluster $k $, recalculate centroid: \[\mu_k=\frac{\sum_{i=1}^n 1\left \{c_i = k \right \}x_i}{\sum_{i=1}^n 1\left \{c_i = k \right \ }}\]
When the cluster is complete, the $\left \{c_k \right\}_{k=1}^k$ is used to represent the $K $ cluster, a loss can be defined to measure the effect of the cluster, and the loss is used as the stop condition for the iteration, in the form of the following:
\[j = \sum_{k}\sum_{x_i\in c_k}| | x_i-\mu_k| | ^2\]
Iterations can be stopped when the loss of two iterations $J $ basically does not change, or if the sample of each cluster does not change substantially. For the K-means process:
K-means is very simple, there are two key problems in practical application, namely: the selection of K value and the choice of initial centroid, the following are discussed separately:
K-Value selection:
1) Elbow Method: When the selected K value is less than the real, k each increase 1,cost value will be greatly reduced; When the K value is greater than the true K, the change in K per increment of 1,cost value is less obvious. In this way, the correct k value will be at this turning point, similar to the elbow place. Such as
2)
The BIC is calculated as follows:
\[bic =–2 \ln (likelihood) +ln (N) \times k\]
? where n is the number of samples in the dataset, and the number of K features. BIC is a measure of the degree of fit and complexity of the model, and the -2*LN (Likehood) Part of the calculation formula is a measure of the fit of the model, and the greater the value, the worse the fitting degree. The model complexity is measured by ln (N) *k.
The likelihood function is generally calculated by probability, that is, l (θ) =∏p (y|x;θ), is a small value between 0 and 1, combined with the LN function of the image to know that ln (likehood) is a negative, and the likelihood function is smaller, corresponding to an absolute value of the larger negative number, so -2*ln ( Likehood) The larger the model, the worse the fitting fit. If the model has a likelihood function (such as GMM), use Bic, DIC and other decision-making, even if there is no likelihood function, such as Kmean, can also make a false likelihood, for example, with GMM, etc. instead
Initial centroid selection:
1) k-means++ method, first randomly selects a point as the first initial class cluster center point, and then selects the point farthest from the point as the second initial class cluster center point, and then selects the closest distance from the first two points of the largest point as the center point of the third initial class cluster, and so on, Until the center point of the K initial cluster is selected.
2) k-menasii algorithm selects multiple points in each cycle as quasi-centroid (quasi-centroid is likely to be centroid in the future), after the cyclic n times, will select enough quasi-centroids. The number of quasi-centroid is much larger than K, and the number of quasi-centroid selected in each cycle will generally be very large, such as 1000 each time, so that the number of cycles is much smaller than k, the computational efficiency will be much higher. Finally, the quasi-centroid in C can be clustered (using the k-means++ algorithm), and K-centroid in the result of clustering is used as the K-centroid of the original data. In this way, not only the calculation efficiency of centroid is improved, but also the position of the chosen K centroid will be better, because it is the centroid of the cluster generation. The K-means II algorithm is more suitable for parallel computing than the k-means++ algorithm, because it does not require strict selection of K points as centroids, just preselection, so that 1-5 steps into different machines to calculate, and finally the selection of all points in the reducer together, and then clustering, Results are similar to non-parallel computations.
If the effect is not good,
Reference documents
http://kunlun.info/2016/03/25/kmeans_essential/
http://www.cnblogs.com/washa/p/4027284.html k Value Selection
Selection of initial centroid of http://www.cnblogs.com/kemaswill/archive/2013/01/26/2877434.html
Http://www.xuebuyuan.com/2096905.html
Http://blog.jqian.net/post/k-means.html
K-means algorithm