The advantages and disadvantages of the binary K-means Clustering (bisecting k-means) algorithm:
Since this is an improved algorithm for K-means, the pros and cons are the same.
Algorithm idea:
1. To understand this should first understand the K-means algorithm, you can see the idea of this algorithm is: first, all points as a cluster, and then split the cluster into one. Then select the cluster that can minimize the clustering cost function (that is, the sum of squared errors) is divided into two clusters (or choose the largest clusters, etc., choose a variety of methods). This continues until the number of clusters is equal to the number of users given by K.
2. The implication of the above is that, because of the sum of squares of squared errors and the ability to measure clustering performance, the smaller the value, the better the clustering effect is when the data points are close to their centroid. So we need to divide the sum of squared errors and the largest clusters, because the greater the squared error, the more the cluster clustering is more likely to be a cluster, so we first need to partition this cluster.
3. About the advantages of K-means "machine learning in action" is said to be able to overcome the K-means convergence to the local minimum, But I thought about it. This does not guarantee convergence to the global optimal value (and the subsequent run of code results will also appear less good, do not know if this is not a proof)
4. Through the access to some information and summaries, the advantages of K-means clustering are:
-
- not affected by initialization problems , because there is no random point selection, and each step guarantees the minimum error
Therefore, this algorithm is not guaranteed to be completely unaffected by the impact of K must be the global minimum, but relatively superior, and there is a certain speed improvement. A deviation in understanding is welcome to correct.
Function:
biKmeans(dataSet, k, distMeas=distEclud)
This function implements the two-point algorithm, the process is generally as follows (the comments in the code are already very detailed):
1. Initialize the centroid of all points and establish the required data storage structure
2. Try two points for each cluster (the first is a cluster) and choose the best
3. Update the number of elements in each cluster
1 defBikmeans (DataSet, K, distmeas=disteclud):2m =shape (DataSet) [0]3Clusterassment = Mat (Zeros ((m,2)))#the result and error of recording cluster distribution4CENTROID0 = Mean (DataSet, axis=0). ToList () [0]#calculate the centroid of the entire data set5Centlist =[CENTROID0]#Create a list with one centroid6 forJinchRange (m):#calculate the distance between the initial cluster point and other points7clusterassment[j,1] = Distmeas (Mat (CENTROID0), dataset[j,:]) **28 while(Len (centlist) <k):9Lowestsse =infTen forIinchRange (len (centlist)):#try to divide each cluster OnePtsincurrcluster = Dataset[nonzero (clusterassment[:,0]. A==i) [0],:]#get the data points currently in cluster I ACentroidmat, Splitclustass = Kmeans (Ptsincurrcluster, 2, Distmeas)#run a Kmeans algorithm on this cluster, k=2 -Ssesplit = SUM (splitclustass[:,1])#Compare the SSE to the currrent minimum -Ssenotsplit = SUM (Clusterassment[nonzero (clusterassment[:,0). a!=i) [0],1]) the Print "Ssesplit, and Notsplit:", Ssesplit,ssenotsplit - if(Ssesplit + Ssenotsplit) < lowestsse:##划分后更好的话 -Bestcenttosplit =I -Bestnewcents =Centroidmat +Bestclustass =splitclustass.copy () -Lowestsse = Ssesplit +Ssenotsplit +Bestclustass[nonzero (bestclustass[:,0]. A = = 1) [0],0] = Len (centlist)#update the allocation result of a cluster change 1 to 3,4, or whatever ABestclustass[nonzero (bestclustass[:,0]. A = = 0) [0],0] =Bestcenttosplit at Print 'The Bestcenttosplit is:', Bestcenttosplit - Print 'The len of Bestclustass is:', Len (bestclustass) -Centlist[bestcenttosplit] = bestnewcents[0,:].tolist () [0]#Replace a centroid with the best centroids -Centlist.append (bestnewcents[1,:].tolist () [0]) -Clusterassment[nonzero (clusterassment[:,0]. A = = bestcenttosplit) [0],:]= Bestclustass#reassign new clusters, and SSE - returnMat (centlist), clusterassment
The effect of clustering should still be good, better than K-means
From for notes (Wiz)
Two-point K-means algorithm