Python implements the k-means clustering algorithm in detail, pythonk-means

Last Update:2017-11-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Algorithm advantages and disadvantages:

Advantages: easy to implement
Disadvantage: It may converge to the local minimum value, which is slow in large-scale data sets.
Use Data Type: numeric data

Algorithm IDEA

In fact, the k-means algorithm determines the similarity between different samples by calculating the distance between different samples. The similarity will be placed in the same category.

1. first, we need to select a K value, that is, the number of classes we want to divide the data into. Here, the selection of K value has a great impact on the results, ng's course describes two selection methods: elbow method. Simply put, it is best to determine the value of k Based on the clustering result and the functional relationship of k. The other is determined based on specific requirements. For example, you may consider dividing the shirt size into three types (L, M, S ).

2. then we need to select the original clustering point (or the center of mass). The selection here is generally random, and the selection in the Code is random within the data range, the other is to randomly select the points in the data. The selection of these points will greatly affect the final result, that is to say, if you are not lucky, it will go to the local minimum value. Here there are two Processing Methods: one is to take the mean multiple times, and the other is the improved algorithm (bisecting K-means)

3. Finally, we started to get into the topic. Next we will calculate the distance between all the points in the dataset and the centroid, and divide them into the class closest to their centroid. After that, we need to calculate the average value of each cluster and use this vertex as the new center of the center. Repeat these two steps until convergence is achieved.

Function

loadDataSet(fileName)

Read a dataset from a file

distEclud(vecA, vecB)

Calculate the distance. Here we use the Euclidean distance. Of course, other reasonable distances are acceptable.

randCent(dataSet, k)

Randomly generate the initial centroid, which is a point within the selected data range.

kMeans(dataSet, k, distMeas=distEclud, createCent=randCent)

Kmeans algorithm, input data and K value. The distance calculation method and the initial center selection method are optional for the next two tasks.

show(dataSet, k, centroids, clusterAssment)

Visualized results

# Coding = utf-8from numpy import * def loadDataSet (fileName): dataMat = [] fr = open (fileName) for line in fr. readlines (): curLine = line. strip (). split ('\ t') fltLine = map (float, curLine) dataMat. append (fltLine) return dataMat # calculates the distance between two vectors, using Euclidean distance def distEclud (vecA, vecB): return sqrt (sum (power (vecA-vecB, (2) # randomly generate the initial center (ng class says that the initial method is to randomly select K points) def randCent (dataSet, k): n = shape (dataSet) [1] centroids = ma T (zeros (k, n) for j in range (n): minJ = min (dataSet [:, j]) rangeJ = float (max (array (dataSet) [:, j])-minJ) centroids [:, j] = minJ + rangeJ * random. rand (k, 1) return centroidsdef kMeans (dataSet, k, distMeas = distEclud, createCent = randCent): m = shape (dataSet) [0] clusterAssment = mat (zeros (m, 2) # create mat to assign data points # to a centroid, also holds SE of each point centroids = createCent (data Set, k) clusterChanged = True while clusterChanged: clusterChanged = False for I in range (m ): # for each data point assign it to the closest centroid minDist = inf minIndex =-1 for j in range (k): distJI = distMeas (centroids [j,:], dataSet [I,:]) if distJI <minDist: minDist = distJI; minIndex = j if clusterAssment [I, 0]! = MinIndex: clusterChanged = True clusterAssment [I,:] = minIndex, minDist ** 2 print centroids for cent in range (k ): # recalculate centroids ptsInClust = dataSet [nonzero (clusterAssment [:, 0]. A = cent) [0] # get all the point in this cluster centroids [cent,:] = mean (ptsInClust, axis = 0) # assign centroid to mean return centroids, clusterAssmentdef show (dataSet, k, centroids, clusterAssment): from matplotlib import pyplot as plt numSamples, dim = dataSet. shape mark = ['or', 'ob', 'og ',' OK ',' ^ R', '+ R', 'sr', 'Dr ', '<R', 'pr'] for I in xrange (numSamples): markIndex = int (clusterAssment [I, 0]) plt. plot (dataSet [I, 0], dataSet [I, 1], mark [markIndex]) mark = ['Dr ', 'db', 'dg', 'dk ', '^ B', '+ B', 'SB ', 'db',' <B ', 'petab'] for I in range (k): plt. plot (centroids [I, 0], centroids [I, 1], mark [I], markersize = 12) plt. show () def main (): dataMat = mat(loadDataSet('testSet.txt ') myCentroids, clustAssing = kMeans (dataMat, 4) print myCentroids show (dataMat, 4, myCentroids, clustAssing) if _ name _ = '_ main _': main ()

Here is the clustering result, which is quite good.

But sometimes it will converge to the local minimum value, as shown below, it is unfortunately converged to the local optimum.

Summary

The above is all the details about how to implement the k-means clustering algorithm in python. I hope to help you. Interested friends can continue to refer to this site:

Analysis of Python memory management method and garbage collection Algorithm

Python data structure and algorithm list (linked list, linked list) Simple implementation

Evaluate the number of different Binary Trees on n nodes using the Python Algorithm

If you have any questions, you can leave a message at any time. The editor will reply to you in a timely manner. Thank you for your support!

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Python implements the k-means clustering algorithm in detail, pythonk-means

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support