Kmeans clustering implementation code in python, pythonkmeans
The k-means algorithm is simple in concept. The easy-to-understand point is that the k-means algorithm has its own shortcomings, and it takes a little time to implement the k-means algorithm in python, for example, the k-means ++ algorithm has been proposed for the selection of k's initial position. The k-means ++ algorithm has not been well-developed, according to this classic theory, the contour coefficient is used. The binary clustering algorithm determines the k size. At last, the implementation of the binary clustering algorithm is also written. The Code mainly refers to the actual machine learning practice book:
# Encoding: UTF-8 ''' Created on August 1, September 21, 2015 @ author: ZHOUMEIXU204 ''' path = u "D: \ Users \ zhoumeixu204 \ Desktop \ python language Machine Learning \ machine learning practice code python \ machine learning practice code \ machinelearninginaction \ Ch10 \ "import numpy np def loadDataSet (fileName): # Read data dataMat = [] fr = open (fileName) for line in fr. readlines (): curLine = line. strip (). split ('\ t') fltLine = map (float, curLine) dataMat. append (fltLine) return dataMat def dist Eclud (vecA, vecB): # calculates the return np distance. sqrt (np. sum (np. power (vecA-vecB, 2) def randCent (dataSet, k): # construct vertex center n = np. shape (dataSet) [1] centroids = np. mat (np. zeros (k, n) for j in range (n): minJ = np. min (dataSet [:, j]) rangeJ = float (np. max (dataSet [:, j])-minJ) centroids [:, j] = minJ + rangeJ * np. random. rand (k, 1) return centroids dataMat=np.mat(loadDataSet(path+'testSet.txt ') print (dataMat [:, 0]) # All numbers are larger than-inf # All numbers are greater than + inf Small def kMeans (dataSet, k, distMeas = distEclud, createCent = randCent): m = np. shape (dataSet) [0] clusterAssment = np. mat (np. zeros (m, 2) centroids = createCent (dataSet, k) clusterChanged = True while clusterChanged: clusterChanged = False for I in range (m): minDist = np. inf; minIndex =-1 # np. inf indicates infinity for j in range (k): distJI = distMeas (centroids [j,:], dataSet [I,:]) if distJI minDist = distJI; minIndex = j if clusterAssment [I, 0]! = MinIndex: clusterChanged = True clusterAssment [I,:] = minIndex, minDist ** 2 print centroids for cent in range (k): ptsInClust = dataSet [np. nonzero (clusterAssment [:, 0]. A = cent) [0] # [0] Here 0 is used to remove the coordinate index value. There are two # np results. the nonzero function is used to search for nz = np for non-zero elements. nonzero ([, 0]) returns 0, 1, 2 centroids [cent,:] = np. mean (ptsInClust, axis = 0) return centroids, clusterAssment myCentroids, clustAssing = kMeans (dataMat, 4) print (myCentroids, clust Assing) # bisecting k-means def biKmeans (dataSet, k, distMeas = distEclud): m = np. shape (dataSet) [0] clusterAssment = np. mat (np. zeros (m, 2) centroid0 = np. mean (dataSet, axis = 0 ). tolist () [0] centList = [centroid0] for j in range (m): clusterAssment [j, 1] = distMeas (np. mat (centroid0), dataSet [j,:]) ** 2 while (len (centList) lowestSSE = np. inf for I in range (len (centList): ptsInCurrCluster = dataSet [np. nonzero (clus TerAssment [:, 0]. A = I) [0],:] centroidMat, splitClusAss = kMeans (ptsInCurrCluster, 2, distMeas) sseSplit = np. sum (splitClusAss [:, 1]) sseNotSplit = np. sum (clusterAssment [np. nonzero (clusterAssment [:, 0]. a! = I) [0], 1]) print "sseSplit, and notSplit:", sseSplit, sseNotSplit if (sseSplit + sseNotSplit) bestCenToSplit = I bestNewCents = centroidMat bestClustAss = splitClusAss. copy () lowestSSE = sseSplit + sseNotSplit bestClustAss [np. nonzero (bestClustAss [:, 0]. A = 1) [0], 0] = len (centList) bestClustAss [np. nonzero (bestClustAss [:, 0]. A = 0) [0], 0] = bestCenToSplit print "the bestCentToSplit is:", bestCenToSplit print 'the len of bestClustAss is: ', len (bestClustAss) centList [bestCenToSplit] = bestNewCents [0,:] centList. append (bestNewCents [1,:]) clusterAssment [np. nonzero (clusterAssment [:, 0]. A = bestCenToSplit) [0],:] = bestClustAss return centList, clusterAssment print (u "binary clustering analysis result start") centList, myNewAssments = biKmeans (dataMat3, 3) print (centList)
The above is all the content of this article. I hope it will be helpful for your learning and support for helping customers.