The implementation of the K-means clustering algorithm in "machine learning combat" by Python

Last Update:2015-07-28 Source: Internet

Author: User

Tags install matplotlib

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The implementation of the K-means clustering algorithm in "machine learning combat" by Python

The most recent project is about "circuit failure analysis based on data mining", the project is basically what the seniors are doing, I'm just studying the following algorithms used in the project: Binary mean clustering, nearest neighbor classification, rule-based classifier, and support vector machine. Based on the confidentiality of the project (in fact, there is nothing confidential, but afraid that the boss will see me write this blog post, so, you understand), here does not introduce "based on data mining circuit failure Analysis" idea.

Don't say much nonsense, start the subject ha.

Basic K-Means clustering algorithm

The basic concept of the basic K-means is : first select the K initial centroid (a bit of the value of the measure in the collection), the K value is expected to get the number of clusters, the size is specified by the user; Assigns each point to the nearest centroid, and the distance between points is measured by the absolute value of the measure difference of two points. The centroid of each cluster is then updated according to the points assigned to the cluster, repeatedly assigned and updated until the cluster no longer changes, or the termination condition is met.

Its pseudo-code is as follows:

Create a K-point as the initial centroid point (random selection)
When the result of a cluster assignment at any point has changed
For each point in the data set
For each centroid
Calculate the distance between the centroid and the data point
Assign a data point to the nearest cluster
For each cluster, calculate a bit of the mean in the cluster and use the mean as the centroid

The Python implementation code is as follows: The comment basically writes quite quite detailed, because oneself still is the python beginner, thought that the annotation is too inconvenient to look, also please everybody understanding, has the mistake hope everybody correct.
The libraries used are numpy and matplotlib, which can be installed directly by the following commands.

Pip Install NumPy
Pip Install Matplotlib

kmeans.py file

 fromNumPyImport*ImportTimeImportMatplotlib.pyplot asPlt# calculate Euclidean distance def eucldistance(Vector1, Vector2):      returnsqrt (SUM (Power (Vector2-vector1,2)))#求这两个矩阵的距离, Vector1, 2 are matrices# init centroids with random samples#在样本集中随机选取k个样本点作为初始质心 def initcentroids(DataSet, k):NumSamples, Dim = Dataset.shape#矩阵的行数, number of columnsCentroids = Zeros ((k, dim))#感觉要不要你都可以     forIinchRange (k): index = int (Random.uniform (0, NumSamples))#随机产生一个浮点数, then convert it to int typeCentroids[i,:] = Dataset[index,:]returnCentroids# K-means Cluster#dataSet为一个矩阵#k为将dataSet矩阵中的样本分成k个类 def Kmeans(DataSet, k):NumSamples = dataset.shape[0]#读取矩阵dataSet的第一维度的长度, that is, how many sample data are obtained    # first column stores which cluster this sample belongs to,    # second column stores the error between this sample and its centroidClusterassment = Mat (Zeros (NumSamples,2)))#得到一个N 0 MatrixClusterchanged =True      # # Step 1:init centroidsCentroids = Initcentroids (DataSet, K)#在样本集中随机选取k个样本点作为初始质心     whileClusterchanged:clusterchanged =False          # # for each sample         forIinchRange (NumSamples):#rangeMindist =100000.0Minindex =0              # # for each centroid            # # Step 2:find the centroid who is closest            #计算每个样本点与质点之间的距离, put it within the cluster with the smallest distance             forJinchRange (k): Distance = Eucldistance (centroids[j,:], dataset[i,:])ifDistance < Mindist:mindist = Distance Minindex = J# # Step 3:update its cluster            #k个簇里面与第i个样本距离最小的的标号和距离保存在clusterAssment中            #若所有的样本不在变化, exit the while Loop            ifClusterassment[i,0]! = Minindex:clusterchanged =TrueClusterassment[i,:] = Minindex, mindist**2  #两个 * * represents the square of a mindist        # # Step 4:update centroids         forJinchRange (k):#clusterAssment [:, 0]. A==j is the subscript that finds the row in the first column of the matrix clusterassment that is equal to J, and returns a list in array with the first array equal to J subscriptPointsincluster = Dataset[nonzero (clusterassment[:,0]. A = = j) [0]]#将dataSet矩阵中相对应的样本提取出来Centroids[j,:] = mean (pointsincluster, Axis =0)#计算标注为j的所有样本的平均值    Print(' Congratulations, cluster complete! ')returnCentroids, Clusterassment# Show your cluster only available with 2-d data#centroids为k个类别, which holds the centroid of each category#clusterAssment为样本的标记, the first column is the category number for this sample, and the second is the distance to the centroid of this category def showcluster(DataSet, K, Centroids, clusterassment):NumSamples, Dim = Dataset.shapeifDim! =2:Print("sorry! I can not draw because the dimension of your data are not 2! ")return 1Mark = [' or ',' OB ',' og ',' OK ',' ^r ',' +r ',' SR ',' Dr ',' <r ',' PR ']ifK > Len (Mark):Print("sorry! Your K is too large! ")return 1     # Draw All Samples     forIinchRange (numsamples): markindex = Int (clusterassment[i,0])#为样本指定颜色Plt.plot (Dataset[i,0], Dataset[i,1], Mark[markindex]) mark = [' Dr ',' Db ',' Dg ',' Dk ',' ^b ',' +b ',' SB ',' DB ',' <b ',' PB ']# Draw the Centroids     forIinchRange (k): Plt.plot (Centroids[i,0], Centroids[i,1], mark[i], markersize = A) Plt.show ()

Test file test.py

 fromNumPyImport*ImportTimeImportMatplotlib.pyplot asPltImportKmeans# # Step 1:load DataPrint("Step 1:load data ...") DataSet = []#列表, to indicate that each element in the list is also a two-dimensional list; The two-dimensional list is a sample, and the sample contains our attribute values and class alias. #与我们所熟悉的矩阵类似, eventually we'll get the n*2 matrix,Filein = open ("D:/xuepython/testset.txt")#是正斜杠 forLineinchFilein.readlines (): temp=[] Linearr = Line.strip (). Split (' \ t ')#line. Strip () remove the "\ n" at the endTemp.append (Float (linearr[0]) Temp.append (float (linearr[1])) Dataset.append (temp)#dataSet. Append ([Float (linearr[0]), float (linearr[1])) #上面的三条语句可以有这条语句代替Filein.close ()# # Step 2:clustering ...Print("Step 2:clustering ...") DataSet = Mat (DataSet)The #mat () function is a library function in NumPy that transforms an array into a matrixK =4Centroids, clusterassment = Kmeans.kmeans (DataSet, K)#调用KMeans文件中定义的kmeans方法. # # Step 3:show The resultPrint("Step 3:show The result ...") Kmeans.showcluster (DataSet, K, Centroids, Clusterassment)

Run the result diagram as follows:

above are the results of the two clusters that appear. Because of the randomness of the centroid selection of the basic K-means clustering algorithm, the results of clustering are generally relatively random, generally not very ideal, and the final result tends to be indistinguishable from natural clusters, in order to avoid this problem, the binary K mean clustering algorithm is used in this paper .

The implementation of the binary K-means clustering Python is given in the next blog post.

Complete code and test data can be obtained here, or you want to get the source from the connection, because the copy code from the page will appear without indentation, you need to add indentation, more trouble, when you encounter indentationerror:unindent does not Match any outer indentation level error, is the indentation caused by the error, you can read this blog post, this blog gave a solution.

In addition to reference to the "machine learning Combat" this book, but also refer to the following blog, this blog is basically a reference to the "machine learning Combat" This book, here, thank the author.

The implementation of the K-means clustering algorithm in "machine learning combat" by Python

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More