K-means Clustering Algorithm Python implementation

Last Update:2014-11-28 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

K-means Clustering algorithm algorithm advantages and disadvantages:

Advantages: Easy to implement
Disadvantage: May converge to local minimum, slow convergence on large scale datasets
Working with Data types: numeric data

Algorithmic thinking

The K-means algorithm is actually calculated by calculating the distance between the different samples to determine their close relationship, the similar will be placed in the same category.

1. First we need to choose a K-value, that is, we want to divide the data into how many classes, here the choice of K value on the results of a great influence, Ng's choice of the course said there are two kinds of elbow method, Simply put, according to the result of the cluster and the function of K to determine the number of K is the best effect. The other is based on specific needs, such as the clustering of shirts size you might consider dividing into three categories (l,m,s), etc.

2. then we need to select the initial cluster point (or called centroid), where the selection is usually randomly selected, the code is randomly selected within the data range, and the other is randomly selected points in the data. The choice of these points will greatly affect the final result, that is to say, bad luck to the local minimum value. There are two methods of processing, one is to take the mean value multiple times and the other is the improved algorithm (bisecting K-means)

3. finally we begin to get to the point where we will calculate the distance from all the points in the data set to these centers of mass, and divide them into the category closest to their centroid. When we do, we need to calculate the average for each cluster and use this as the new centroid. Repeat these two steps again and again until we converge and get the final result.

Function

loadDataSet(fileName)
Reading data sets from a file
distEclud(vecA, vecB)
Calculate the distance, here is the Euclidean distance, of course, other reasonable distances are possible
randCent(dataSet, k)
Randomly generates the initial centroid, which is a point within the selected data range
kMeans(dataSet, k, distMeas=distEclud, createCent=randCent)
Kmeans algorithm, input data and K value. Last two things optional distance calculation method and initial centroid selection method
show(dataSet, k, centroids, clusterAssment)
Visualize results

1#coding =utf-82  fromNumPy Import *3 4 def loaddataset (fileName):5Datamat = []6FR =Open (FileName)7      forLineinchfr.readlines ():8CurLine = Line.strip (). Split ('\ t')9Fltline = Map (float, CurLine)Ten datamat.append (fltline) One     returnDatamat A      - #计算两个向量的距离, with Euclid's distance . - def disteclud (Veca, VECB): the     returnsqrt (SUM (Power (VECA-VECB,2))) -  - #随机生成初始的质心 (Ng's lesson is that the initial method is to randomly select K points) - def randcent (DataSet, k): +n = shape (dataSet) [1] -Centroids =Mat (Zeros ((k,n))) +      forJinchrange (n): AMinj =min (dataset[:,j]) atRangej =float(Max (Array (dataSet) [:, J])-Minj) -CENTROIDS[:,J] = Minj + Rangej * Random.rand (k,1) -     returncentroids -      -def Kmeans (DataSet, K, Distmeas=disteclud, createcent=randcent): -m = shape (DataSet) [0] inClusterassment = Mat (Zeros (M,2)) #create mat to assign data points - #to A centroid, also holds SE of each point toCentroids =Createcent (DataSet, K) +Clusterchanged =True -      whileclusterchanged: theClusterchanged =False *          forIinchRange (m): # forEach data point assign it to the closest centroid $Mindist =infPanax NotoginsengMinindex =-1 -              forJinchRange (k): theDistji =Distmeas (centroids[j,:],dataset[i,:]) +                 ifDistji <mindist: AMindist = Distji; Minindex =J the             ifClusterassment[i,0] !=Minindex: +Clusterchanged =True -Clusterassment[i,:] = minindex,mindist**2 $ Print Centroids $          forcentinchRange (k): #recalculate centroids -Ptsinclust = Dataset[nonzero (clusterassment[:,0]. A==cent) [0]]#GetAll the pointinch  ThisCluster -Centroids[cent,:] = mean (Ptsinclust, axis=0) #assign centroid to mean the     returnCentroids, Clusterassment -     Wuyi def show (DataSet, K, Centroids, clusterassment): the      fromMatplotlib Import Pyplot asPLT -NumSamples, Dim =Dataset.shape WuMark = ['or','ob','og','OK','^r','+r','SR','Dr','<r','PR']   -      forIinchxrange (numsamples): AboutMarkindex =int(Clusterassment[i,0])   $Plt.plot (Dataset[i,0], Dataset[i,1], Mark[markindex]) -Mark = ['Dr','Db','Dg','Dk','^b','+b','SB','DB','<b','PB']   -      forIinchRange (k): -Plt.plot (Centroids[i,0], Centroids[i,1], mark[i], markersize = A)   A plt.show () +        the def main (): -Datamat = Mat (Loaddataset ('TestSet.txt')) $Mycentroids, clustassing= Kmeans (Datamat,4) the Print Mycentroids theShow (Datamat,4, Mycentroids, clustassing) the      the      - if__name__ = ='__main__': inMain ()

This is a clustering result, or very good, but sometimes it will converge to the local minimum, just like the following, is unfortunate convergence to the local optimal

From for notes (Wiz)

K-means Clustering Algorithm Python implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

K-means Clustering Algorithm Python implementation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

K-means Clustering Algorithm Python implementation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support