Two-K mean-value algorithm

Source: Internet
Author: User

First of all we know that K mean algorithm has a fried chicken big bug, in many cases he will only converge to the local minimum rather than the global minimum value, in order to solve this problem, many scholars put forward a lot of methods, we introduce a method called 2 K mean value.

The algorithm first takes all the points as a cluster and then divides the cluster into one. After selecting one of the clusters to continue dividing, select which cluster to divide depends on which cluster's SSE is the maximum value. The above-mentioned SSE-based partitioning process is repeated until the number of clusters specified by the user is reached.

Consider all points as a cluster, when the coarse number is less than K, the total error is calculated for each cluster, and the K-mean Clustering (k=2) is performed on a given coarse, and the total error after the coarse split is calculated. Finally, select the largest cluster of SSE to divide. Repeats several times until the number of clusters is equal to K.

1. First post the K-mean function

# coding=utf-8from NumPy Import *import matplotlibimport matplotlib.pyplot as pltimport operatorfrom OS import listdirimpo RT Timedef Disteclud (Veca, VECB): Return sqrt (SUM (Power (VECA-VECB, 2)) # La.norm (VECA-VECB) def randcent (DataSet, K)  : n = shape (DataSet) [1] centroids = Mat (Zeros ((k, N)) # Create centroid Mat for J in range (n): # Create Random Cluster centers, within bounds of each dimension Minj = Min (dataset[:, j]) Rangej = float (max (dataset[:, J] )-Minj) centroids[:, j] = Mat (Minj + Rangej * Random.rand (k, 1)) return centroidsdef Kmeans (DataSet, K, Distme As=disteclud, createcent=randcent): M = dataset.shape[0] clusterassment = Zeros ((M, 2)) centroids = Createcent (da Taset, k) # Print Centroids Show (centroids) clusterchanged = True while clusterchanged:clusterchanged  = False for I in Range (m): Point = Dataset[i,:] # traversal per dot mindist = inf Minindex = -1 for N inRange (k): Heart = Centroids[n,:] # traverse each centroid distance = Distmeas (point, Heart) # Find points and centroid distances  If distance < mindist:mindist = Distance # update minimum distance mindist Minindex = N # Update the centroid ordinal of the minimum distance if clusterassment[i, 0]! = minindex:clusterchanged = True clusterassment[i,:] = Minindex, Mindist * * 2 # variance # Print Clusterassment for cent in range (k): Ptsinclust = dataset[(c Lusterassment[:, 0] = = cent)] # get all the "point" this cluster # print ptsinclust If Len (ptsincl                UST): centroids[cent,:] = mean (Ptsinclust, axis=0) # assign centroid to mean else: Centroids[cent,:] = Array ([[[0, 0]]) show (centroids,color= ' green ') return centroids, Clusterassmentdef Show ( Data,color=none): If not color:color= ' green ' group=createdataset () FIG = plt.figure (1) axes = Fig.add_ Subplot (111) Axes.scaTter (group[:, 0], group[:, 1], s=40, c= ' Red ') Axes.scatter (data[:, 0], data[:, 1], s=50, C=color) plt.show () def Crea Tedataset (): group = Array ([[1.0, 1.1], [1.0, 1.0], [0, 0], [0, 0.1], [2, 1.0], [2.  1, 0.9], [0.3, 0.0], [1.1, 0.9], [2.2, 1.0], [2.1, 0.8], [3.3, 3.5], [2.1, 0.9], [2, 1.0], [2.1, 0.9], [3.5, 3.4], [3.6, 3.5]]) Retu RN group# Centroids, Clusterassment=kmeans (CreateDataSet (), 4) # Show (centroids,color= ' yellow ')

  



On this basis we add a binary algorithm

def bikmeans (DataSet, K, distmeas=disteclud): M = shape (DataSet) [0] #点数 clusterassment = Mat (Zeros ((m,2))) #空矩 Array centroid0 = mean (DataSet, axis=0). ToList () #数据集的平均值, ToList is converted to List # print centroid0 centlist = [CENTROID0] # C Reate a list with one centroid to each centroid set up a listing container for J in range (m): #遍历数据集 Clusterassment[j,1]=distmeas (Mat (CE NTROID0), Dataset[j,:]) **2 #求出数据集中每一个点到先前选定质心的距离平方 #并将其赋值给clusterAssm                                                    ENT The second value of the corresponding column of this matrix. #而第一个值全都赋值为0表示当前只有一个簇 # Print Mat (CENTROID0), Dataset[j,:] # print Clusterassment while (len (centlist) < K) : #簇的数量小于4时 Lowestsse = inf #初始化lowersse为正无穷 for I in range (len (centlist)): # Iterates through every centroid that is present, the entire                         The traversal process is a process of finding the centroid, but even 2 points, the results are uncertain #这个循环的目的只是得到划分哪个质心可以得到最大效益, regardless of how to divide, the main reason is that K-means exist very            # Big Chance, can't get the exact result. Print I ptsincurrcluster = daTaset[nonzero (clusterassment[:, 0]. A = = i) [0],: # All datasets owned by each cluster # print Ptsincurrcluster #当前for loop with only one cluster Centroidmat, s Plitclustass = Kmeans (Ptsincurrcluster, 2, Distmeas) #k-Mean algorithm, k=2 # print Centroidmat # Axes.sc Atter (centroidmat[:, 0], centroidmat[:, 1], s=40, c= ' Blue ') #将得到的两个质心描绘出来 # plt.show () # pri    NT Splitclustass #点分配到两个质心的分配方式矩阵 ssesplit = SUM (splitclustass[:, 1]) #分配后的点方差之和 Error SSE # print ssesplit ssenotsplit = SUM (Clusterassment[nonzero (clusterassment[:, 0]. A! = i) [0], 1]) #不在当前分配簇中的点方差之和 print "Ssesplit, and Notsplit:", Ssesplit, Ssenotsplit if (ssespli                         T + ssenotsplit) < lowestsse: # Two variances and less than the previous lowersse, it means that this distribution reduces the error rate Bestcenttosplit = i  #暂时将当前划分方式设为最佳 bestnewcents = Centroidmat #暂时将当前的划分质心设为最好 Bestclustass =Splitclustass.copy () #复制当前划分的 points assigned to two centroid distribution matrix Lowestsse = Ssesplit + ssenotsplit #更新最小lowersse # Two best centroid will be drawn out # show (bestnewcents) # print Bestclustass Bestclustass[nonzero (bestclustass[ :, 0]== 1) [0], 0] = Len (centlist) # 2 K clustering return factor 0, or 1, need to change 1 to the current number of clusters, so as not to cause duplicate Bestclustass[nonzero (bestclustass[:, 0] = = 0) [0], 0] = Bestcenttosplit #把0换成别切分的簇, or with the above Exchange assignment can also print ' The Bestcenttosplit is: ', Bestcenttosplit prin T ' The Len of Bestclustass is: ', Len (bestclustass) Centlist[bestcenttosplit] = bestnewcents[0,:].tolist () [0] # will Centlist Center of Mass at the specified position is replaced by the segmented centroid centlist.append (bestnewcents[1,:].tolist () [0]) #将另一个质心添加上去 Clustera Ssment[nonzero (clusterassment[:, 0]. A = = Bestcenttosplit) [0],:] = bestclustass # assigns the new centroid and point distribution to the result matrix # print Clusterassment Show (Mat (Centlist), Color= "Blue") return Mat (centlist), clusterassmentcent,clusterassment= Bikmeans (CreateDataSet (), 4) Show (CENT,color= ' Yellow ') 

So we ended up with a 2-point k mean, and the effect was greatly optimized.




Two-K mean-value algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.