Kmeans clustering and its Python implementation

Last Update:2018-08-10 Source: Internet

Author: User

Tags compact

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The main reference K-means clustering algorithm and Python code implementation also "machine learning combat" This book, of course, the previous link is also reference this book, understand the principle, will be used on the line.

1. Overview

K-means algorithm is a distance-based clustering algorithm that combines simple and classic

Using distance as the evaluation index of similarity, the closer the distance between two objects is considered, the greater the similarity is.

The algorithm considers that the cluster is composed of objects that are close to each other, so the compact and independent cluster is the ultimate goal.

Plainly is unsupervised clustering, we are all the same label, or no label, and then this heap of data is a class, that heap is a class, you set up a few classes, the algorithm automatically helps you to separate the various classes, as long as the sample of each class as compact as possible.

2. Core Ideas

by iterative searching for a partition scheme of K clusters, the total error is minimized when the mean value of K cluster is used to represent the corresponding samples.

K clusters have the following characteristics: Each cluster itself is as compact as possible, and each cluster is as separate as possible.

The basis of the K-means algorithm is the minimum squared error sum criterion ,

The cost function is:

, Μc (i) represents the mean of the first cluster.

The more similar the sample in all kinds of clusters, the smaller the squared error between it and the mean value, sums the squared error of all the classes, and can verify whether the clusters are optimal when they are divided into K classes.

The cost function of the above formula can not be minimized with the analytic method, only the iterative method.

3. Algorithm Step diagram

Shows the effect of K-means clustering on n sample points, where K takes 2.

4. Algorithm implementation steps

K-means algorithm is to cluster the sample into K-clusters (cluster), where k is the user given, the solution is very straightforward, the specific algorithm is described as follows:

1) Random selection of K cluster centroid points

2) Repeat the following process until the convergence {

For each example I, calculate the class it should belong to:

For each class J, recalculate the centroid of the class:

}

Its pseudo-code is as follows:

******************************************************************************

Create a K-point as the initial centroid point (random selection)

When the result of a cluster assignment at any point has changed

For each point in the data set

For each centroid

Calculate the distance between the centroid and the data point

Assign a data point to the nearest cluster

For each cluster, calculate a bit of the mean in the cluster and use the mean as the centroid

********************************************************

5. K-means Clustering Algorithm Python combat

This is the code in the book.

Requirements: Clustering a given set of data

This case uses a two-dimensional data set, a total of 80 samples, there are 4 classes.

$ wc-l Testset.txt;head TestSet.txt
TestSet.txt
1.6589854.285136
-3.4536873.424321
4.838138-1.151539
-5.379713-3.362104
0.9725642.924086
-3.5679191.531611
0.450614-3.302219
-3.487105-1.724432
2.6687591.594842
-3.1564853.191137

#!/usr/bin/env python#-*-coding:utf-8-*-#time:18-8-8 pm 2:17#Author:dahu#File:kmeans2.py#Software:pycharm#from:https://www.cnblogs.com/ahu-lichang/p/7161613.htmlImportsysreload (SYS) sys.setdefaultencoding ('UTF-8') fromNumPyImport*ImportMatplotlib.pyplot as Plt#Loading DatadefLoaddataset (FileName):#parse the file, press TAB to split the field, get a matrix of floating point number typeDatamat = []#the last field of a file is a category labelFR =Open (FileName) forLineinchfr.readlines (): CurLine= Line.strip (). Split ('\ t') Fltline= Map (float, curline)#turn each element into a float typedatamat.append (fltline)returnDatamat#Calculate Euclidean distancedefDisteclud (Veca, VECB):returnsqrt (SUM (Power (VECA-VECB, 2))#find the distance between two vectors#construct the cluster center and take K (in this case k=4) random centroiddefRandcent (DataSet, k): N= Shape (DataSet) [1] Centroids= Mat (Zeros ((k,n)))#each centroid has n coordinate values, with a total of K centroid     forJinchrange (N): Minj=min (dataset[:,j]) MAXJ=Max (Dataset[:,j]) Rangej= Float (MAXJ-Minj) Centroids[:,j]= Minj + Rangej * Random.rand (k, 1)    returncentroids#K-means Clustering AlgorithmdefKmeans (DataSet, K, Distmeans =disteclud, createcent =randcent):" ":p Aram DataSet: A DataSet without lable (in this case two-dimensional data):p Aram K: Divided into clusters:p Aram Distmeans: Functions that calculate distances:p Aram Createcent: function to get K random centroid: Return:centroids: The final determined K centroid clusterassment: What class the sample belongs to and the distance to the centroid of that class" "m= Shape (DataSet) [0]#m=80, number of samplesClusterassment = Mat (Zeros (m,2)))    #clusterassment The first column holds the center point to which the data belongs, and the second column is the distance from the data to the center point ,Centroids =Createcent (DataSet, k) clusterchanged= True#to determine if clustering has converged     whileclusterchanged:clusterchanged=False;  forIinchRange (m):#divide each data point into its nearest center pointMindist = inf; Minindex =-1;  forJinchRange (k): Distji=Distmeans (centroids[j,:], dataset[i,:])ifDistji <mindist:mindist= Distji; Minindex = J#If the first data point is closer to the J Center Point, then I is attributed to J            ifclusterassment[i,0]! =minindex:clusterchanged= True#If the allocation changes, you need to continue the iterationClusterassment[i,:] = minindex,mindist**2#and the assignment of the number of points in the first place into a dictionary        #Print Centroids         forcentinchRange (k):#re-center pointPtsinclust = Dataset[nonzero (clusterassment[:,0]. A = = cent) [0]]#go to all columns in the first column equal to centCentroids[cent,:] = mean (ptsinclust, axis = 0)#figure out the center point of the data    returnCentroids, Clusterassment#--------------------Test----------------------------------------------------#Using test data and test Kmeans algorithmif __name__=='__main__': Datmat= Mat (Loaddataset ('TestSet.txt'))    #print min (datmat[:,0])    #print Max (datmat[:,1])    #print randcent (datmat,4)mycentroids,clustassing = Kmeans (datmat,4)    PrintMycentroids#print Clustassing,len (clustassing)Plt.figure (1) x=Array (datmat[:,0]). Ravel () Y=array (datmat[:,1]). Ravel () Plt.scatter (x, y, marker='o') Xcent=Array (mycentroids[:,0]). Ravel () Ycent=array (mycentroids[:,1]). Ravel () Plt.scatter (xcent, ycent, marker='x', color='R', s=50) plt.show ()

Operation Result:

Code is not particularly difficult, see can see clearly, found a little numpy operation, there are some are based on NumPy boolean array operation, to fill a complement. No. 04 NumPy Basics: Arrays and vector calculations

Simply say the functions of each function:

Loaddataset: Loading the data

Disteclud: Calculates the distance, the annotation is calculates Euclidean distance, in fact is calculates  each sample to each cluster centroid distance, this is used to determine the centroid coordinates.

Kmeans: The main function, realizes the Kmeans algorithm

The comments have been more detailed, and they are no longer elaborate. There is also a place in the back of the book that is optimized for Kmeans, which is not covered here.

Kmeans clustering and its Python implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More