K-means Clustering Algorithm Python implementation

Source: Internet
Author: User

K-means Clustering algorithm algorithm advantages and disadvantages:

Advantages: Easy to implement
Disadvantage: May converge to local minimum, slow convergence on large scale datasets
Working with Data types: numeric data

Algorithmic thinking

The K-means algorithm is actually calculated by calculating the distance between the different samples to determine their close relationship, the similar will be placed in the same category.

1. First we need to choose a K-value, that is, we want to divide the data into how many classes, here the choice of K value on the results of a great influence, Ng's choice of the course said there are two kinds of elbow method, Simply put, according to the result of the cluster and the function of K to determine the number of K is the best effect. The other is based on specific needs, such as the clustering of shirts size you might consider dividing into three categories (l,m,s), etc.

2. then we need to select the initial cluster point (or called centroid), where the selection is usually randomly selected, the code is randomly selected within the data range, and the other is randomly selected points in the data. The choice of these points will greatly affect the final result, that is to say, bad luck to the local minimum value. There are two methods of processing, one is to take the mean value multiple times and the other is the improved algorithm (bisecting K-means)

3. finally we begin to get to the point where we will calculate the distance from all the points in the data set to these centers of mass, and divide them into the category closest to their centroid. When we do, we need to calculate the average for each cluster and use this as the new centroid. Repeat these two steps again and again until we converge and get the final result.

Function

loadDataSet(fileName)
Reading data sets from a file
distEclud(vecA, vecB)
Calculate the distance, here is the Euclidean distance, of course, other reasonable distances are possible
randCent(dataSet, k)
Randomly generates the initial centroid, which is a point within the selected data range
kMeans(dataSet, k, distMeas=distEclud, createCent=randCent)
Kmeans algorithm, input data and K value. Last two things optional distance calculation method and initial centroid selection method
show(dataSet, k, centroids, clusterAssment)
Visualize results

  1. 1#coding =utf-82  fromNumPy Import *3 4 def loaddataset (fileName):5Datamat = []6FR =Open (FileName)7      forLineinchfr.readlines ():8CurLine = Line.strip (). Split ('\ t')9Fltline = Map (float, CurLine)Ten datamat.append (fltline) One     returnDatamat A      - #计算两个向量的距离, with Euclid's distance . - def disteclud (Veca, VECB): the     returnsqrt (SUM (Power (VECA-VECB,2))) -  - #随机生成初始的质心 (Ng's lesson is that the initial method is to randomly select K points) - def randcent (DataSet, k): +n = shape (dataSet) [1] -Centroids =Mat (Zeros ((k,n))) +      forJinchrange (n): AMinj =min (dataset[:,j]) atRangej =float(Max (Array (dataSet) [:, J])-Minj) -CENTROIDS[:,J] = Minj + Rangej * Random.rand (k,1) -     returncentroids -      -def Kmeans (DataSet, K, Distmeas=disteclud, createcent=randcent): -m = shape (DataSet) [0] inClusterassment = Mat (Zeros (M,2)) #create mat to assign data points - #to A centroid, also holds SE of each point toCentroids =Createcent (DataSet, K) +Clusterchanged =True -      whileclusterchanged: theClusterchanged =False *          forIinchRange (m): # forEach data point assign it to the closest centroid $Mindist =infPanax NotoginsengMinindex =-1 -              forJinchRange (k): theDistji =Distmeas (centroids[j,:],dataset[i,:]) +                 ifDistji <mindist: AMindist = Distji; Minindex =J the             ifClusterassment[i,0] !=Minindex: +Clusterchanged =True -Clusterassment[i,:] = minindex,mindist**2 $ Print Centroids $          forcentinchRange (k): #recalculate centroids -Ptsinclust = Dataset[nonzero (clusterassment[:,0]. A==cent) [0]]#GetAll the pointinch  ThisCluster -Centroids[cent,:] = mean (Ptsinclust, axis=0) #assign centroid to mean the     returnCentroids, Clusterassment -     Wuyi def show (DataSet, K, Centroids, clusterassment): the      fromMatplotlib Import Pyplot asPLT -NumSamples, Dim =Dataset.shape WuMark = ['or','ob','og','OK','^r','+r','SR','Dr','<r','PR']   -      forIinchxrange (numsamples): AboutMarkindex =int(Clusterassment[i,0])   $Plt.plot (Dataset[i,0], Dataset[i,1], Mark[markindex]) -Mark = ['Dr','Db','Dg','Dk','^b','+b','SB','DB','<b','PB']   -      forIinchRange (k): -Plt.plot (Centroids[i,0], Centroids[i,1], mark[i], markersize = A)   A plt.show () +        the def main (): -Datamat = Mat (Loaddataset ('TestSet.txt')) $Mycentroids, clustassing= Kmeans (Datamat,4) the Print Mycentroids theShow (Datamat,4, Mycentroids, clustassing) the      the      - if__name__ = ='__main__': inMain ()

This is a clustering result, or very good, but sometimes it will converge to the local minimum, just like the following, is unfortunate convergence to the local optimal



From for notes (Wiz)



K-means Clustering Algorithm Python implementation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.