Introduction and comparison of three kinds of improvement (K-means++,isodata,kernel K-means) of K-means Clustering algorithm

Source: Internet
Author: User

I. Overview

In this article, four clustering algorithms (K-means,k-means++,isodata and kernel K-means) are described in detail and data sets are used to truly reflect the differences between the four algorithms.

The first thing to be clear is that all four of these algorithms are " hard-clustered " algorithms, where each sample in the DataSet is scored by 100% to a certain category. The relative " soft clustering " can be understood as the probability that each sample is divided into a certain category.

Firstly, the relationship between the above four algorithms is briefly expounded, and the readers who have already known the classical K-means algorithm should have some experience. Readers who have not known K-means can look at the following classic K-means algorithm introduction and come back to see this section.

(1) K-means and k-means++: The original K-means algorithm initially randomly selects K points in the dataset as the cluster center, and k-means++ Select K cluster centers as follows: Assuming that n initial cluster centers (0<N<K) have been selected, when selecting the N+1 Cluster Center: the farther away from the current n cluster centers, the higher the probability of being selected as the first n+1 cluster center. The same random method is used when selecting the first cluster Center (n=1). We can say that this is also in line with our intuition: The cluster Center is of course the farther away from each other the better. This improvement is straightforward, but very effective.

(2) K-means and Isodata:the full name of Isodata is the iterative self-organizing data analysis method. In K-means, the value of K needs to be predetermined artificially and cannot be changed throughout the algorithm. When encountering high-dimensional, massive datasets, it is often difficult to estimate the size of K accurately. Isodata is the improvement of this problem, its ideas are very intuitive: when the number of samples belonging to a category is too young to remove this category, when the number of samples belonging to a category is too large, the classification is divided into two sub-categories.

(3) K-means and kernel K-means: traditional K-means uses Euclidean distance to measure the similarity between samples, obviously not all datasets are suitable for this measure. Referring to the idea of kernel function in support vector machine, it is possible to improve the clustering effect by mapping all samples to another feature space and then clustering. This article does not introduce kernel K-means in detail.

It can be seen that the above three kinds of improvements aimed at K-means are from different angles and are therefore very representative. There is a wide range of applications that should still be k-means++ algorithms (such as the Nips on k-means++ at the end of 2016, where interested readers can learn further).

Second, the classical K-means algorithm

The algorithm is described below and is very clear and understandable. The classic K-means algorithm should be the beginning of each unsupervised learning tutorial will be speaking of content, it is no longer more to say again.

Figure 1. Classical K-means algorithm

It is worth mentioning that the number of clusters (k value) of the selection, there is a feasible method, called Elbow: by plotting the K-means cost function and the number of clusters of the graph of K, select the linear inflection point at the K value as the best number of cluster centers. But in this way not to do too much introduction, because the above method of the inflection point in the actual situation is rarely seen. the more advocated practice is to specify a reasonable K value manually from the practical problem, and select the satisfactory result by multiple random initialization clustering center.

Three, k-means++ algorithm

The k-means++ proposed by D. Arthur and others in 2007 improved the first step in Figure 1 . This improvement can be intuitively understood as the K initial cluster centers should be split between each other better. The description of the entire algorithm is as follows:

Figure 2. k-means++ algorithm

The following is a simple example of how k-means++ selects the initial cluster center. There are 8 samples in the dataset, with the distribution and corresponding ordinal numbers as shown:

Figure 3. k-means++ Example

Suppose that after step 6th of Figure 2 is selected as the first initial cluster center, then the probability of D (x) for each sample and selected as the second cluster center in step Two o'clock are shown in the following table:

    P (x) is the probability that each sample is selected as the next cluster center. The last line of Sum is the probability P (x) Cumulative sum, for the roulette method to select the second cluster center. The method is to randomly generate a random number between the 0~1 and determine which interval it belongs to, then the corresponding ordinal of the interval is the second cluster center selected. For example, the interval of point 1th is [0,0.2] and the interval of point 2nd is [0.2, 0.525].

from the table above can be intuitively see the second initial cluster center is number 1th, number 2nd, 3rd, 4th in the probability of one of 0.9. These 4 points are exactly four points away from the first initial cluster center, point 6th. This also validates the idea of K-means: that there is a greater probability that a point farther from the current cluster Center is selected as the next cluster center. It can be seen that the K value of this example is 2 more appropriate. When the K value is greater than 2 o'clock, each sample will have more than one distance, which needs to take the minimum distance as D (x).

Four, Isodata algorithm

The last and most complex is the isodata algorithm. As previously mentioned, the K-means and k-means++ of the cluster center number K is fixed unchanged. While the Isodata algorithm can adjust the number of cluster centers according to the actual situation of each category K: (1) splitting operation , corresponding to increasing the number of cluster centers, (2) merging operations , corresponding to reduce the number of cluster centers.

The following first gives the input to the Isodata algorithm (the input data and the number of iterations are no longer separate):

[1] Number of expected cluster centers Ko: Although the number of cluster centers is variable during isodata operation, a reference standard needs to be specified by the user. In fact, the range of clustering centers of the algorithm is also determined by Ko . Specifically, the final output of the cluster center number range is [ko/2, 2Ko].

[2] Minimum number of samples required per class Nmin: Used to determine if a class contains a large sample dispersion can be split operation. If the split causes a subcategory to contain fewer than Nmin, the category will not be split.

[3] Maximum variance Sigma: Used to measure the degree of dispersion of samples in a category. When the dispersion of the sample exceeds this value, it is possible to perform a split operation (note that the conditions described in [2] are met at the same time).

[4] Two categories for the minimum allowable distance between cluster centers DMin: If two categories are very close (that is, the distance between the two categories corresponds to the cluster center is very small), you need to merge the two categories. The threshold for merging is determined by DMin .

I believe a lot of people after the introduction of the above input to the ISODATA algorithm process has been some speculation. Indeed, the principle of the isodata algorithm is very intuitive, but since it needs to specify more parameters than the other two methods, and some parameters are also difficult to accurately specify a more reasonable value, so the isodata algorithm in the actual process is not k-means++ popular.

First, a description of the body of the isodata algorithm is given, as shown in:

Figure 4. The body part of the isodata algorithm

What is not clearly stated in the above description is the split operation in step 5th and the merge operation in step 6th. The following first describes the merge operation:

Figure 5. Merging operations of the ISODATA algorithm

Finally, the split operation in the Isodata algorithm.

Figure 6. Splitting operation of the isodata algorithm

Finally, according to the isodata algorithm, This algorithm can dynamically adjust the number of clustering centers in the clustering process according to the actual situation of each class. If a class has a large sample dispersion (measured by variance) and a large number of samples, it is split, and if a two category is relatively close (measured by the distance from the cluster center), they are merged.

What may not be clear is the 1th and 2nd steps of the isodata-split operation. Similarly, take the data set shown in Figure three as an example, assuming that the original 1,2,3,4,5,6,8 number is divided into the same class, the 1th and 2nd steps are performed as follows:

In the case of correct classification (i.e., 1,2,3,4 is a class; 5,6,7,8 is a Class), the variance is 0.33. Therefore, the current variance is far greater than the ideal variance, and the isodata algorithm is likely to split the operation.

Five, clustering algorithm source code

I have integrated the above three algorithms into a MATLAB function clustering.m. Readers can use this function to cluster datasets directly. Because the code is long, and the code plug-in is not very used, it is not described in the article. Readers who need to use it can click on the link below to download the use (Welcome star and Fork, after which the new algorithms and optimizations will be added occasionally):

Https://github.com/AaronX121/Unsupervised-Learning-Clustering

The use of the method is very simple, currently supports three forms of input, respectively, corresponding to the above three kinds of algorithms:

[Centroid, result] = clustering(data, ' Kmeans ', k, iteration);

[centroid, result] = clustering(data, ' kmeans++ ', k, iteration);

[centroid, result] =

Clustering (Data, ' Isodata ', desired_k, iteration, Minimum_n, maximum_variance, minimum_d);

The input data is a matrix, and each row represents a sample in the DataSet. The meaning of the other inputs corresponds to one by one of the above algorithm description. The centroid of the output is the location of the cluster center, and result is the category index for each sample.

VI. Data Set Testing

Finally, an example of a simple data set satisfying two Gaussian distributions is presented, showing the clustering results of the above three algorithms, as shown in.

Figure 7. Clustering of three algorithms on a simple dataset (the green plus sign represents the central location of the cluster)

Introduction and comparison of three kinds of improvement (K-means++,isodata,kernel K-means) of K-means Clustering algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.