K-means clustering algorithm; binary k mean clustering algorithm

Source: Internet
Author: User
Tags cos save file sin pprint

According to the tenth chapter of "Machine Learning Combat", the K-mean clustering algorithm and the binary K-mean clustering algorithm are used to understand the code side by tapping and correcting some minor errors in the original code of the book. At present, the code will sometimes appear the following 4 kinds of error messages, which need to continue to explore and improve.

Error message:

Warning (from warnings module): File "F:\Python2.7.6\lib\site-packages\numpy\core\_methods.py", Line Warnings.warn ( "Mean of empty Slice.", runtimewarning) Runtimewarning:mean of empty slice. Warning (from warnings module): File ' F:\Python2.7.6\lib\site-packages\numpy\core\_methods.py ', line + ret, Rcount, O Ut=ret, casting= ' unsafe ', Subok=false) Runtimewarning:invalid value encountered in true_dividewarning (from warnings Module): File "E:\ learning \ Learning \python books \ Machine learning actual combat source code \ch10\kmeans.py", line return sqrt (SUM (Power (VECA-VECB, 2))) #la. Norm (V ECA-VECB) Runtimewarning:invalid value encountered in Powertraceback (most recent call last): File "E:\ learning \ Learning \python books \ Machine Science Practical Practice Source Code \ch10\kmeans.py ", line 147, in <module> centlist,mynewassments = Bikmeans (newdatamat,i) File" E:\ Learning \ Learning \pyt Hon Books \ Machine learning combat source code \ch10\kmeans.py ", line Bikmeans centroidmat, Splitclustass = Kmeans (Ptsincurrcluster, 2, Distmeas    File "E:\ learning \ Learning \python books \ Machine learning actual combat source code \ch10\kmeans.py", line KmeansCentroids = Createcent (DataSet, k) File "E:\ learning \ Learning \python books \ Machine learning actual combat source code \ch10\kmeans.py", line, in randcent Minj = min ( DATASET[:,J]) valueerror:min () arg is an empty sequence

The Python code for the K-mean clustering algorithm and the binary K-mean clustering algorithm is as follows:

#-*-Coding:utf-8-*-"Kmeans Clustering algorithm basic idea: based on the distance and pre-defined cluster number k, first of all, randomly selected K initial cluster center (different cluster centers will result in convergence speed and clustering results differ, it is possible to fall into the local optimal.) Next, calculate the distance from each point to the center of each cluster, and assign it to the nearest class cluster third, recalculate the center of each cluster of four, repeat the second and third steps until the center of the class cluster is no longer changing, the basic idea of the cluster stop binary K-means clustering algorithm the algorithm first takes all points as a cluster, The cluster is then divided into two. Then select one of the clusters to continue dividing, and select which cluster to divide depends on whether the value of SSE is minimized by its partitioning. The above-mentioned SSE-based partitioning process is repeated until the number of user-specified clusters is obtained. This code is slightly fine-tuned for the original code. Correction of minor details errors Author: <[email protected]>data:2016-06-11 "Import Pickle,matplotlibimport Matplotlib.pyplot as Pltfrom numpy import *from pprint import pprintdef loaddataset (filename): "Load data matrix from file par  AM FileName: Save file name for Data Matrix STR return DataSet: Data matrix [[[],[],...]  List (list) "Datamat = [] with open (filename) as F:l = F.readlines () to line in L:curline = Line.strip (). Split (' \ t ') Fltline = map (float,curline) datamat.append (fltline) return datamatdef DISTECLU D (VECA,VECB): "Calculates the Euclidean distance of two vectors param VECA,VECB: Two vectors to be calculated Numpy.ndarray return: Euclidean distance of two vectors ' ' return s QRT (SUM (Power (VECA-VECB,2)) def randcent (dataset,k): "' randomly selects K Initial cluster center param dataset: Data matrix param k: cluster return:k initial cluster center ' n =        Shape (DataSet) [1] centroids = Mat (Zeros ((k,n))) for J in Range (n): Try:minj = min (Dataset[:,j]) Except:print DataSet MAXJ = max (dataset[:,j]) Rangej = float (maxj-minj) centroid S[:,J] = Minj + Rangej * Random.rand (k,1) return centroidsdef Kmeans (dataset,k,distmeas = Disteclud,createcent = RandCe  NT): ' K-means clustering algorithm param DataSet: DataSet Numpy.matrix param K: Number of categories int param Distmeas: Method for calculating distances, by default Euclidean distance function param createcent: The way to generate the initial cluster center, the default is to randomly generate function return Centroids: cluster-like result of the cluster center Numpy.matrix return Clusteras                                                                Sment: The cluster result of each record Numpy.matrix matrix ([[[Clustertag, Euclidean distance squared],    [Clustertag, Euclidean distance squared],...]) ' m = shape (DataSet) [0] clusterassment = Mat (Zeros ((m,2))) Centroids = Createcent (dataset,k) Clusterchanged = True # records the number of iterations iteration = 1 while clusterchanged:clusterchanged = False for I in Range (m): Mindist = inf; Minindex = 1 for j in Range (k): Distji = Distmeas (Centroids[j,:],dataset[i,:]) I F Distji < mindist:mindist = Distji;minindex = J # Determine if the category is changed, and then decide whether to stop the class center transformation, only the category of each point is no longer sent The cluster can terminate if clusterassment[i,0]! = minindex:clusterchanged = True clusterassmen T[i,:] = minindex,mindist**2## print "Iter:", iteration## iteration + = 1## Print centroids for               cent in range (k): ' mean (Matrix,axis = 0) means that the mean is calculated along the direction of the matrix column clusterassment[:,0] = = cent returns the bool type matrix Nonzero (clusterassment[:,0] = = cent) returns the coordinates of a true value (array (x1,x2,..), Array (Y1,y2,...)), (first-dimensional, second-dimensional coordinates) No            Nzero (clusterassment[:,0] = = cent) [0] returns the first-dimensional coordinate array (x1,x2,..) "' Ptsinclust = Dataset[nonzero (clusterassment[:,0]. A = = cent) [0]] centroids[cent,:] = mean (Ptsinclust,axis = 0) return centroids,clusterassmentdef bikmeans (dat Aset,k,distmeas = Disteclud): "Binary K-means clustering algorithm, basic idea: param DataSet: DataSet Numpy.matrix param K: number of clusters param dis Tmeas: The method of calculating distances, by default Euclidean distance function return Mat (centlist): Cluster Center Numpy.matrix return clusterassment: Cluster result for each record numpy . Matrix Matrix ([[Clustertag, Euclidean distance squared], [Clustertag, Euclidean distance squared],..    .]) ' m = shape (DataSet) [0] # record number clusterassment = Mat (zeros (m,2)) # Initialize cluster result to zero matrix CENTROID0 = mean (Dataset,axis = 0). ToList () [0] # initializes the center of the first class cluster, all the records, and the result is a list centlist = [CENTROID0] # centlist is used to save the class cluster center, first to join the initialization class cluster center, The length of the centlist, that is, the number of clusters # calculates the Euclidean distance of each record to the center of the initial class cluster for J in range (m): clusterassment[j,1] = Distmeas (dataset[j,:],mat (CENTROID0)) **2 while (Len (centlist) < K): # When the number of clusters reaches K stops Lowestsse = inf # initialization error squared and is positive infinity # try to divide the existing clusters, looking for SSE's biggest declineThe cluster, and then 2-means the cluster for the I in Range (len (centlist)): # Remove the record belonging to class cluster I ptsincurrcluster = dataset[ Nonzero (clusterassment[:,0].            A==i) [0],:] # 2-means Clustering Division of cluster I centroidmat,splitclustass = Kmeans (Ptsincurrcluster,2,distmeas) # 2-means Cluster I is divided into clusters to find the SSE ssesplit = SUM (splitclustass[:,1]) # for other classes of clusters other than class cluster I, the SSE SS Enotsplit = SUM (Clusterassment[nonzero (clusterassment[:,0).               a!=i) [0],1]) # # print "Ssesplit,ssenotsplit:", Ssesplit,ssenotsplit # Determine if the total SSE after dividing the cluster I is smaller than Lowestsse If ssesplit+ssenotsplit < Lowestsse:lowestsse = Ssesplit+ssenotsplit Bestcentto Split = i bestnewcents = centroidmat.copy () Bestclustass = Splitclustass.copy () BESTC Lustass[nonzero (bestclustass[:,0]. A = = 1) [0],0] = Len (centlist) Bestclustass[nonzero (bestclustass[:,0]. A = = 0) [0],0] = bestcenttosplit## print "The Bestcenttosplit is:", bestcenttosplit## print" The length of Bestclustass is: ", Len (bestclustass) # Update Save a list of class cluster centers Centlist[b        Estcenttosplit] = bestnewcents[0,:].tolist () [0] Centlist.append (bestnewcents[1,:].tolist () [0]) # Update cluster results Clusterassment[nonzero (clusterassment[:,0]. A = = Bestcenttosplit) [0],:] = Bestclustass return Mat (centlist), Clusterassmentdef DISTSLC (VECA,VECB): "' calculates two points on the sphere     The formula for the distance between the points of a, latitude β1, longitude α1; point B, Latitude β2, longitude α2. Then the distance S = R * Arc cos[cosβ1*cosβ2*cos (α1-α2) +sinβ1*sinβ2] param veca:a Point coordinates param vecb:b Point coordinate return spherical distance "a = sin (veca[0,1]*pi/180) * sin (vecb[0,1]*pi/180) b = cos (VECA[0,1]*PI/18  0) * COS (vecb[0,1]*pi/180) * cos (PI * (vecb[0,0]-veca[0,0])/180) return Arccos (a+b) *6371.0def clusterclubs (numclust =  5): "The points on the map are clustered according to the spherical distance and shown on the map param numclust: Number of default class clusters 5 int" ' datlist = [] with open (' Places.txt ') as F:l = F.readlines () for line in L:linearr = Line.strip (). Split (' \ t ') DatlIst.append ([Float (linearr[4]), float (linearr[3])) Datmat = Mat (datlist) mycentroids,clustassing = Bikmeans (datmat,nu                      Mclust,distmeas = DISTSLC) FIG = plt.figure () rect = [0.1,0.1,0.8,0.8] scattermarkers = [' s ', ' o ', ' ^ ', ' 8 ', ' P ', ' d ', ' V ', ' h ', ' > ', ' < ', ' Axprops = dict (xticks = [],yticks = []) ax0 = fig.add_axes (Rect,label = ' ax0 ', **axprops) IMGP = Plt.imread (' portland.png ') ax0.imshow (IMGP) ax1 = fig.add_axes (Rect,label = ' ax1 ', frame On = False) for I in Range (numclust): Ptsincurrcluster = Datmat[nonzero (clustassing[:,0]. A = = i) [0],:] MarkerStyle = scattermarkers[i% len (scattermarkers)] Ax1.scatter (Ptsincurrcluster[:,0].flatte N (). A[0], Ptsincurrcluster[:,1].flatten (). A[0], marker = Markerstyle,s = +) Ax1.scatter (Mycentroids[:,0].flatten (). A[0], Mycentroids[:,1].flatten (). A[0], marker = ' + ', s = +) Plt.show () def resudisp (clusterassing): "Output shows cluster result param clusterassing: cluster result matrix return resudict: Cluster result dictionary" ' Resulist = Clusterassin G.tolist () resudict = {} for I in range (len (resulist)): label = resulist[i][0] Resudict[label] = Resud       Ict.get (label,[]) resudict[label].append (i) for Key,value in Resudict.iteritems (): Print key, ': ', value Return resudictdef Storeresu (resudict,filename): "Storage clustering result to file ' with open (filename, ' W ') as F:pi Ckle.dump (RESUDICT,F) def getresu (filename): "reads a cluster result from a file" with open (filename) as F:resudict = pic Kle.load (f) return resudictdef Plotcluster (datmat,clustassing,mycentroids,numclust): "' Paint show clustering results param Datmat  : DataSet param clustassing: cluster result param mycentroids: Cluster center param numclust: Number of class clusters "' fig = plt.figure () rect = [0.1,0.1,0.8,0.8] scattermarkers = [' s ', ' o ', ' ^ ', ' 8 ', ' P ', ' d ', ' V ', ' h ', ' > ', ' < ',] Ax1 = Fig . add_axes (Rect,label = 'Ax1 ', Frameon = False) for I in Range (numclust): Ptsincurrcluster = Datmat[nonzero (clustassing[:,0]. A = = i) [0],:] MarkerStyle = scattermarkers[i% len (scattermarkers)] Ax1.scatter (Ptsincurrcluster[:,0].flatte N (). A[0], Ptsincurrcluster[:,1].flatten (). A[0], marker = Markerstyle,s = +) Ax1.scatter (Mycentroids[:,0].flatten (). A[0], Mycentroids[:,1].flatten (). A[0], marker = ' + ', s = +) Plt.show () if __name__ = = "__main__": # # Datamat = Mat (Loaddataset (' Te StSet.txt ') # # Newdatamat = Mat (Loaddataset (' TestSet2.txt ') # # print "Minimum of 0 column:", Min (datamat[:,0]) # # P  Rint "Minimum of 1 column:", Min (datamat[:,1]) # # print "Maximun of 1 column:", Max (datamat[:,1]) # # print "Maximun of 0 Column: ", Max (datamat[:,0]) # # print" Random centroids: \ n "# # Pprint (randcent (datamat,2)) # # print" Euclidean Dista nCE of Record_0 and Record_1: ", Disteclud (Datamat[0],datamat[1]) # # # # Because of the initial classThe center selection is random, so the number of convergence per run is also inconsistent # # mycentroids,clusterassing = Kmeans (datamat,3) # # resudict = Resudisp (clusterassing) # # Storeresu (resudict, ' Clusterresult ') # # result = Getresu (' Clusterresult ') # # centlist,mynewassments = Bikmeans (newData mat,3) # # Plotcluster (newdatamat,mynewassments,centlist,3) # # resudict = Resudisp (mynewassments) # # TotalSSE = SUM (M ynewassments[:,1] # # clusterclubs (numclust = 5)


I hope you will comment on the imperfections and communicate with each other.



K-means clustering algorithm; binary k mean clustering algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.