K-means Clustering algorithm

Last Update:2018-01-15 Source: Internet

Author: User

Tags lua random seed

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Cluster analysis is to find the relationship between data objects in the data, grouping the data, the greater the similarity within the group, the greater the difference between groups, the better the clustering effect.

Different cluster types

Clustering is designed to find useful object clusters, in reality we use a lot of cluster types, using different cluster types to partition the results of the data is different, such as the next several cluster types.

Distinctly separated.

You can see that the distance between any two points in the different groups in (a) is greater than the distance between any two points in the group, and the clearly separated clusters are not necessarily spherical and can have arbitrary shapes.

Prototype-based

A cluster is a collection of objects in which each object's distance to the prototype that defines the cluster is closer than the prototype distance of the other clusters, as the prototype shown in (b) is the center point, and the data in one cluster is closer to its center point than to the center point of the other cluster. This is a common center-based cluster , and the most commonly used K-means is a cluster type.
Such clusters tend to be spherical.

Density-based

Clusters are the density areas of an object, and (d) are shown by density-based clusters, where clusters are irregular or coiled together, and have morning and outliers, often using density-based cluster definitions.

Refer to the introduction to data mining for more cluster introductions.

The Basic Clustering Analysis algorithm

1. k Mean value:
A prototype-based, partitioned distance technique that attempts to discover clusters of user-specified numbers (K).

2. The level of cohesion of the distance:
The idea is to start with each point as a single point cluster, and then merge the two closest clusters repeatedly until you try a single, inclusive cluster of all points.

3. DBSCAN:
A density-based partition distance algorithm, the number of clusters are automatically determined by the algorithm, the low-density points are considered noise and ignored, so it does not produce a complete cluster.

Distance measurement

Different distance measurements have an effect on the result of the distance, and the common distance metric is as follows:

K-means algorithm

The K-mean algorithm is described below:

Advantages: Easy to implement
Disadvantage: May converge to the local minimum value, in large-scale data convergence slow

The idea of the algorithm is simpler as follows:

选择K个点作为初始质心  repeat      将每个点指派到最近的质心，形成K个簇      重新计算每个簇的质心  until 簇不发生变化或达到最大迭代次数

Here the recalculation of the centroid of each cluster is calculated based on the target function, so we have to consider the distance metric and the objective function at the beginning.

Considering Euclidean distance data, we prefer the smallest of SSE by using the sum of the squares of squared sums (sums of the squared Error,sse) as the objective function of the cluster, two runs of the K-mean generated two different cluster sets.

K represents the K cluster Center, CI represents the center of the first, dist represents Euclidean distance.
One of the problems here is why, we update the centroid is the average of all points, and here is what SSE decides.

The following is implemented in Python

# DataSet Sample points, number of k clusters# Dismeas Distance metric, the default is Euclidean distance# createcent, selection of the initial pointDefKmeans(DataSet, K, Distmeas=disteclud, createcent=randcent): M = shape (DataSet) [0]#样本数 clusterassment = Mat (Zeros (M,2)))#m * * Matrix centroids = Createcent (DataSet, K)#初始化k个中心 clusterchanged =TrueWhile clusterchanged:#当聚类不再变化 clusterchanged =FalseFor IIn range (m): Mindist = inf; Minindex =-1For JIn range (k): #找到最近的质心 Distji = Distmeas (Centroids[j,:],dataset[i,:]) if Distji < mindist:mindist = Distji; Minindex = J If clusterassment[i,0]! = minindex:clusterchanged = True # 1th column belongs to centroid, 2nd column is distance clusterassment[i ,:] = minindex,mindist**2 print centroids # change centroid position for cent in range (k): Ptsinclust = Dataset[nonzer O (clusterassment[:,0]. a==cent) [0]] centroids[cent,:] = mean (ptsinclust, axis=0) return centroids, clusterassment

Focus on understanding:

  for cent in range(k):      ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]] centroids[cent,:] = mean(ptsInClust, axis=0)

Loop through each centroid, find all the points that belong to the current centroid, and then update the current centroid based on those points.
Nonzero () returns a two-dimensional array that represents the position of an element that is not 0.

>>> from numpy import *>>> a=array([[1,0,0],[0,1,2],[2,0,0]])>>> aarray([[1, 0, 0],       [0, 1, 2],       [2, 0, 0]])>>> nonzero(a)(array([0, 1, 1, 2]), array([0, 1, 2, 0]))

denotes [0,0],[1,1] ... Bits are not 0 elements. The first array is a row, the second array is a column, and the two are combined.

ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]
So first compare clusterassment[:,0]. A==cent, if True, records the row he is in, so the slice is used to take the value.

Some of the auxiliary functions:

DefLoaddataset(fileName):#general function to parse tab-delimited floats datamat = []#assume last column is target value FR = open (fileName)For lineIn Fr.readlines (): CurLine = Line.strip (). Split (' \ t ') Fltline = map (float,curline)#map all elements to float () datamat.append (Fltline)Return Datamatdef Disteclud(Veca, VECB): return sqrt (SUM (Power (VECA-VECB, 2)) #la. Norm (VECA-VECB)def C8>randcent(DataSet, K): N = shape (DataSet) [1] centroids = Mat (Zeros ((k,n)))#create centroid mat for J in range (n):#create Random cluster centers, within bounds of each dimension Minj = min (dataset[:,j]) Rangej = f Loat (Max (dataset[:,j])-Minj) centroids[:,j] = Mat (Minj + Rangej * Random.rand (k,1)) return centroids

Run and results

Write the above code into kmeans.py and open the Python interaction side.

>>> from numpy import *>>> import kMeans>>> dat=mat(kMeans.loadDataSet(‘testSet.txt‘)) #读入数据>>> center，clust=kMeans.kMeans(dat,4)[[ 0.90796996  5.05836784] [-2.88425582  0.01687006] [-3.3447423  -1.01730512] [-0.32810867  0.48063528]][[ 1.90508653  3.530091  ] [-3.00984169  2.66771831] [-3.38237045 -2.9473363 ] [ 2.22463036 -1.37361589]][[ 2.54391447 3.21299611] [-2.46154315 2.78737555] [-3.38237045 -2.9473363 ] [ 2.8692781 -2.54779119]][[ 2.6265299 3.10868015] [-2.46154315 2.78737555] [-3.38237045 -2.9473363 ] [ 2.80293085 -2.7315146 ]]# 作图>>>kMeans(dat,center)

The drawing program is as follows:

def draw  (Data,center): Length=len (center) fig=plt.figure # plot the scatter of the original data Figure Plt.scatter (Data[:,0],data[:,1],s= 25,alpha=0.4) # plot the centroid point of a cluster for i in Range (length): Plt.annotate ( ' center ' , xy= (Center[i,0],center[i,1]), xytext= (Center[i,0]+1,center[i,1]+1 ), Arrowprops=dict (Facecolor= "red")) Plt.show ()

defects of the K-means algorithm

The K-mean algorithm is very simple and widely used, but it has the main two defects:
1. k values need to be pre-given , is pre-knowledge, in many cases, the estimation of K-value is very difficult, for like the calculation of all users of the circle such a scene is completely no way to use K-means. For scenarios where k values can be determined not to be too large but not clearly accurate, the K value can be iterated, and then the value of the values corresponding to the minimum cost function is found, which is often a good description of how many cluster classes there are.
2. The K-means algorithm is sensitive to the initially selected cluster center points , and the cluster results obtained by different random seed points are completely different.
3. the K-mean algorithm is not all data types. it cannot handle clusters of non-spherical clusters, different sizes and densities, and the silver Crown specifies the number of clusters that are large enough that he can usually find Junko clusters.
4. when clustering data from outliers, there are also problems with K-mean values, in which case outlier detection and deletion can be helpful.

The following is a discussion of the selection of the initial centroid:

Poor initial centroid

When the initial centroid is randomly initialized, each operation of the K-means will produce different SSE, and the random selection of the initial centroid result may be bad, and may only get the local optimal solution, and cannot get the global optimal solution . As shown in the following:

It can be seen that the program iterates 4 times, it gets the local optimal solution, obviously we can see that it is not the global optimal, we can still find a smaller SSE cluster.

Limitations of random initialization

You might think: Run multiple times, use a different set of random initial centroid each time, and select a cluster with the smallest SSE. The strategy is very simple, but the effect may not be good, depending on the number of clusters the data collection is looking for.

For more, refer to the introduction of data mining

K-means optimization Algorithm

In order to overcome the problem that the K-means algorithm converges to the local minimum value, a binary K-mean value (bisecting K-means) is proposed.

bisecting K-means

The pseudo code of the algorithm is as follows:

将所有的点看成是一个簇当簇小于数目k时    对于每一个簇        计算总误差        在给定的簇上进行K-均值聚类,k值为2        计算将该簇划分成两个簇后总误差    选择是的误差最小的那个簇进行划分

The complete Python code is as follows:

DefBikmeans(DataSet, K, distmeas=disteclud): M = shape (DataSet) [0]# Here first column category, second column SSE clusterassment = Mat (Zeros ((M,2)))# as a cluster yes centroid CENTROID0 = mean (DataSet, axis=0). ToList () [0] Centlist =[CENTROID0]#create a list with one centroidFor JIn range (m):#计算只有一个簇是的误差 Clusterassment[j,1] = Distmeas (Mat (CENTROID0), dataset[j,:]) * *2# Core Codewhile (Len (centlist) < K): Lowestsse = inf# for each centroid, try to divideFor IIn range (len (centlist)):# get the data that belongs to the centroid ptsincurrcluster =\ Dataset[nonzero (clusterassment[:,0]. A==i) [0],:# The centroid is divided into two classes of centroidmat, Splitclustass = Kmeans (Ptsincurrcluster,2, Distmeas)# the SSE ssesplit = SUM (splitclustass[:, which is divided by the cluster) is calculated1])# SSE ssenotsplit = SUM (Clusterassment[nonzero (clusterassment[:) without participating in the divided cluster0]. A!=i) [0],1])Print"Ssesplit, and Notsplit:", Ssesplit,ssenotsplit# Find the smallest SSE to divide# that is, which cluster is divided after the minimum SSEif (Ssesplit + ssenotsplit) < Lowestsse:bestcenttosplit = I bestnewcents = Centroidmat Bestclustass = splitClustAss.co PY () Lowestsse = Ssesplit + ssenotsplit# more difficult to understand part Bestclustass[nonzero (bestclustass[:,0]. A = =0):0],0] = Len (centlist)#change 1 to 3,4, or whatever Bestclustass[nonzero (bestclustass[:,0]. A = =0) [0],0] = bestcenttosplit print ' The Bestcenttosplit is: ', bestcenttosplit print ' The Len of BESTCL Ustass is: ', Len (bestclustass) centlist[bestcenttosplit] = bestnewcents[0,:].tolist () [0]#replace a Centroid with the best centroids centlist.append (bestnewcents[1,:].tolist () [0]) Clusterassment[nonzero ( clusterassment[:,0]. A = = bestcenttosplit) [0],:]= bestclustass#reassign new clusters, and SSE return Mat (centlist), Clusterassment

The final code is parsed as follows:

      bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList) #change 1 to 3,4, or whatever bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit

Here is to change the category to which it belongs, which bestClustAss = splitClustAss.copy() is the matrix returned after K-means, where the first column is the category, the second column is the SSE value, because when k=2 is K-means returned is the category 0, 12 class, So here is the Class 1 change for its centroid length, A class of 0 returns the original category of the cluster.

As an example:
For example: currently divided into 0, 12 clusters, while the requirements are divided into 3 clusters, when the algorithm is carried out, assuming that 1 is divided into the smallest SSE, 1 is divided into 2 clusters, its return value is 0, 12 clusters, return to 1 of the cluster to 2, return 0 of the cluster to 1, so now there are 0,1,2 three clusters.

 Centlist[bestcenttosplit] = BestNewCents[0,:].tolist () [0] #replace a centroid with the best centroids centlist.append (BES Tnewcents[1,:].tolist () [0] ) Clusterassment[nonzero (Clusterassment[:,0]0],:]= bestclustass #reassign New Clusters, and SSE

bestNewCentsThis is the value of the K-means return Cluster Center, which has two values, namely the first cluster, and the second cluster's coordinates (K=2), where the first coordinate is assigned, and the centList[bestCentToSplit] = bestNewCents[0,:].tolist()[0] other coordinates are added to the centlistcentList.append(bestNewCents[1,:].tolist()[0])

Run with results

>>>From NumPyImport *>>>Import Kmeans>>> dat = Mat (Kmeans.loaddataset ( TestSet2.txt ')) >>> Cent,assment=kmeans.bikmeans (Dat,3) SseSplit, Span class= "Hljs-keyword" >and notsplit: 570.722757425 0.0the Bestcenttosplit is: 0the len of Bestclustass is: 60ssesplit, and notSplit: 68.6865481262 38.0629506357ssesplit, and Notsplit: 22.9717718963 532.659806789the bestCentToSplit is: 0the len of Bestclustass is: 40

It can be seen that two divisions were made, the first best division was in 0 clusters, and the second division was in 1 clusters.
The visualization is as follows:

Mini Batch K-means

In the original K-means algorithm, each time the partition of all the samples to participate in the operation, if the data is very large, this time is very high, so there is a batch processing of an improved algorithm.
The distance between data points is calculated using Mini batch (batch processing) method.
The benefits of Mini batch: instead of having to use all the data samples, we take a subset of the samples from different categories to represent the respective types of calculations. n due to the small number of samples, it will reduce the running time accordingly n but on the other hand sampling will inevitably bring about a decrease in accuracy.

K-means Clustering algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More