K-means Clustering algorithm algorithm advantages and disadvantages:
Advantages: Easy to implement
Disadvantage: May converge to local minimum, slow convergence on large scale datasets
Working with Data types: numeric data
Algorithmic thinking
The K-means algorithm is actually calculated by calculating the distance between the different samples to determine their close relationship, the similar will be placed in the same category.
1. First we need to choose a K-value, that is, we want to divide the data into how many classes, here the choice of K value on the results of a great influence, Ng's choice of the course said there are two kinds of elbow method, Simply put, according to the result of the cluster and the function of K to determine the number of K is the best effect. The other is based on specific needs, such as the clustering of shirts size you might consider dividing into three categories (l,m,s), etc.
2. then we need to select the initial cluster point (or called centroid), where the selection is usually randomly selected, the code is randomly selected within the data range, and the other is randomly selected points in the data. The choice of these points will greatly affect the final result, that is to say, bad luck to the local minimum value. There are two methods of processing, one is to take the mean value multiple times and the other is the improved algorithm (bisecting K-means)
3. finally we begin to get to the point where we will calculate the distance from all the points in the data set to these centers of mass, and divide them into the category closest to their centroid. When we do, we need to calculate the average for each cluster and use this as the new centroid. Repeat these two steps again and again until we converge and get the final result.
Function
loadDataSet(fileName)
Reading data sets from a file
distEclud(vecA, vecB)
Calculate the distance, here is the Euclidean distance, of course, other reasonable distances are possible
randCent(dataSet, k)
Randomly generates the initial centroid, which is a point within the selected data range
kMeans(dataSet, k, distMeas=distEclud, createCent=randCent)
Kmeans algorithm, input data and K value. Last two things optional distance calculation method and initial centroid selection method
show(dataSet, k, centroids, clusterAssment)
Visualize results
1#coding =utf-82 fromNumPy Import *3 4 def loaddataset (fileName):5Datamat = []6FR =Open (FileName)7 forLineinchfr.readlines ():8CurLine = Line.strip (). Split ('\ t')9Fltline = Map (float, CurLine)Ten datamat.append (fltline) One returnDatamat A - #计算两个向量的距离, with Euclid's distance . - def disteclud (Veca, VECB): the returnsqrt (SUM (Power (VECA-VECB,2))) - - #随机生成初始的质心 (Ng's lesson is that the initial method is to randomly select K points) - def randcent (DataSet, k): +n = shape (dataSet) [1] -Centroids =Mat (Zeros ((k,n))) + forJinchrange (n): AMinj =min (dataset[:,j]) atRangej =float(Max (Array (dataSet) [:, J])-Minj) -CENTROIDS[:,J] = Minj + Rangej * Random.rand (k,1) - returncentroids - -def Kmeans (DataSet, K, Distmeas=disteclud, createcent=randcent): -m = shape (DataSet) [0] inClusterassment = Mat (Zeros (M,2)) #create mat to assign data points - #to A centroid, also holds SE of each point toCentroids =Createcent (DataSet, K) +Clusterchanged =True - whileclusterchanged: theClusterchanged =False * forIinchRange (m): # forEach data point assign it to the closest centroid $Mindist =infPanax NotoginsengMinindex =-1 - forJinchRange (k): theDistji =Distmeas (centroids[j,:],dataset[i,:]) + ifDistji <mindist: AMindist = Distji; Minindex =J the ifClusterassment[i,0] !=Minindex: +Clusterchanged =True -Clusterassment[i,:] = minindex,mindist**2 $ Print Centroids $ forcentinchRange (k): #recalculate centroids -Ptsinclust = Dataset[nonzero (clusterassment[:,0]. A==cent) [0]]#GetAll the pointinch ThisCluster -Centroids[cent,:] = mean (Ptsinclust, axis=0) #assign centroid to mean the returnCentroids, Clusterassment - Wuyi def show (DataSet, K, Centroids, clusterassment): the fromMatplotlib Import Pyplot asPLT -NumSamples, Dim =Dataset.shape WuMark = ['or','ob','og','OK','^r','+r','SR','Dr','<r','PR'] - forIinchxrange (numsamples): AboutMarkindex =int(Clusterassment[i,0]) $Plt.plot (Dataset[i,0], Dataset[i,1], Mark[markindex]) -Mark = ['Dr','Db','Dg','Dk','^b','+b','SB','DB','<b','PB'] - forIinchRange (k): -Plt.plot (Centroids[i,0], Centroids[i,1], mark[i], markersize = A) A plt.show () + the def main (): -Datamat = Mat (Loaddataset ('TestSet.txt')) $Mycentroids, clustassing= Kmeans (Datamat,4) the Print Mycentroids theShow (Datamat,4, Mycentroids, clustassing) the the - if__name__ = ='__main__': inMain ()
This is a clustering result, or very good, but sometimes it will converge to the local minimum, just like the following, is unfortunate convergence to the local optimal
From for notes (Wiz)
K-means Clustering Algorithm Python implementation