Two-point K-means algorithm

Source: Internet
Author: User

The advantages and disadvantages of the binary K-means Clustering (bisecting k-means) algorithm:

Since this is an improved algorithm for K-means, the pros and cons are the same.

Algorithm idea:

1. To understand this should first understand the K-means algorithm, you can see the idea of this algorithm is: first, all points as a cluster, and then split the cluster into one. Then select the cluster that can minimize the clustering cost function (that is, the sum of squared errors) is divided into two clusters (or choose the largest clusters, etc., choose a variety of methods). This continues until the number of clusters is equal to the number of users given by K.
2. The implication of the above is that, because of the sum of squares of squared errors and the ability to measure clustering performance, the smaller the value, the better the clustering effect is when the data points are close to their centroid. So we need to divide the sum of squared errors and the largest clusters, because the greater the squared error, the more the cluster clustering is more likely to be a cluster, so we first need to partition this cluster.
3. About the advantages of K-means "machine learning in action" is said to be able to overcome the K-means convergence to the local minimum, But I thought about it. This does not guarantee convergence to the global optimal value (and the subsequent run of code results will also appear less good, do not know if this is not a proof)
4. Through the access to some information and summaries, the advantages of K-means clustering are:

    • not affected by initialization problems , because there is no random point selection, and each step guarantees the minimum error

Therefore, this algorithm is not guaranteed to be completely unaffected by the impact of K must be the global minimum, but relatively superior, and there is a certain speed improvement. A deviation in understanding is welcome to correct.

Function:

biKmeans(dataSet, k, distMeas=distEclud)
This function implements the two-point algorithm, the process is generally as follows (the comments in the code are already very detailed):
1. Initialize the centroid of all points and establish the required data storage structure
2. Try two points for each cluster (the first is a cluster) and choose the best
3. Update the number of elements in each cluster

  1. 1 defBikmeans (DataSet, K, distmeas=disteclud):2m =shape (DataSet) [0]3Clusterassment = Mat (Zeros ((m,2)))#the result and error of recording cluster distribution4CENTROID0 = Mean (DataSet, axis=0). ToList () [0]#calculate the centroid of the entire data set5Centlist =[CENTROID0]#Create a list with one centroid6      forJinchRange (m):#calculate the distance between the initial cluster point and other points7clusterassment[j,1] = Distmeas (Mat (CENTROID0), dataset[j,:]) **28      while(Len (centlist) <k):9Lowestsse =infTen          forIinchRange (len (centlist)):#try to divide each cluster OnePtsincurrcluster = Dataset[nonzero (clusterassment[:,0]. A==i) [0],:]#get the data points currently in cluster I ACentroidmat, Splitclustass = Kmeans (Ptsincurrcluster, 2, Distmeas)#run a Kmeans algorithm on this cluster, k=2 -Ssesplit = SUM (splitclustass[:,1])#Compare the SSE to the currrent minimum -Ssenotsplit = SUM (Clusterassment[nonzero (clusterassment[:,0). a!=i) [0],1]) the             Print "Ssesplit, and Notsplit:", Ssesplit,ssenotsplit -             if(Ssesplit + Ssenotsplit) < lowestsse:##划分后更好的话 -Bestcenttosplit =I -Bestnewcents =Centroidmat +Bestclustass =splitclustass.copy () -Lowestsse = Ssesplit +Ssenotsplit +Bestclustass[nonzero (bestclustass[:,0]. A = = 1) [0],0] = Len (centlist)#update the allocation result of a cluster change 1 to 3,4, or whatever ABestclustass[nonzero (bestclustass[:,0]. A = = 0) [0],0] =Bestcenttosplit at         Print 'The Bestcenttosplit is:', Bestcenttosplit -         Print 'The len of Bestclustass is:', Len (bestclustass) -Centlist[bestcenttosplit] = bestnewcents[0,:].tolist () [0]#Replace a centroid with the best centroids -Centlist.append (bestnewcents[1,:].tolist () [0]) -Clusterassment[nonzero (clusterassment[:,0]. A = = bestcenttosplit) [0],:]= Bestclustass#reassign new clusters, and SSE -     returnMat (centlist), clusterassment

The effect of clustering should still be good, better than K-means



From for notes (Wiz)



Two-point K-means algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.