Python implements the k-means clustering algorithm in detail, pythonk-means

Source: Internet
Author: User

Python implements the k-means clustering algorithm in detail, pythonk-means

Algorithm advantages and disadvantages:

Advantages: easy to implement
Disadvantage: It may converge to the local minimum value, which is slow in large-scale data sets.
Use Data Type: numeric data

Algorithm IDEA

In fact, the k-means algorithm determines the similarity between different samples by calculating the distance between different samples. The similarity will be placed in the same category.

1. first, we need to select a K value, that is, the number of classes we want to divide the data into. Here, the selection of K value has a great impact on the results, ng's course describes two selection methods: elbow method. Simply put, it is best to determine the value of k Based on the clustering result and the functional relationship of k. The other is determined based on specific requirements. For example, you may consider dividing the shirt size into three types (L, M, S ).

2. then we need to select the original clustering point (or the center of mass). The selection here is generally random, and the selection in the Code is random within the data range, the other is to randomly select the points in the data. The selection of these points will greatly affect the final result, that is to say, if you are not lucky, it will go to the local minimum value. Here there are two Processing Methods: one is to take the mean multiple times, and the other is the improved algorithm (bisecting K-means)

3. Finally, we started to get into the topic. Next we will calculate the distance between all the points in the dataset and the centroid, and divide them into the class closest to their centroid. After that, we need to calculate the average value of each cluster and use this vertex as the new center of the center. Repeat these two steps until convergence is achieved.

Function

loadDataSet(fileName)

Read a dataset from a file

distEclud(vecA, vecB)

Calculate the distance. Here we use the Euclidean distance. Of course, other reasonable distances are acceptable.

randCent(dataSet, k)

Randomly generate the initial centroid, which is a point within the selected data range.

kMeans(dataSet, k, distMeas=distEclud, createCent=randCent)

Kmeans algorithm, input data and K value. The distance calculation method and the initial center selection method are optional for the next two tasks.

show(dataSet, k, centroids, clusterAssment)

Visualized results

# Coding = utf-8from numpy import * def loadDataSet (fileName): dataMat = [] fr = open (fileName) for line in fr. readlines (): curLine = line. strip (). split ('\ t') fltLine = map (float, curLine) dataMat. append (fltLine) return dataMat # calculates the distance between two vectors, using Euclidean distance def distEclud (vecA, vecB): return sqrt (sum (power (vecA-vecB, (2) # randomly generate the initial center (ng class says that the initial method is to randomly select K points) def randCent (dataSet, k): n = shape (dataSet) [1] centroids = ma T (zeros (k, n) for j in range (n): minJ = min (dataSet [:, j]) rangeJ = float (max (array (dataSet) [:, j])-minJ) centroids [:, j] = minJ + rangeJ * random. rand (k, 1) return centroidsdef kMeans (dataSet, k, distMeas = distEclud, createCent = randCent): m = shape (dataSet) [0] clusterAssment = mat (zeros (m, 2) # create mat to assign data points # to a centroid, also holds SE of each point centroids = createCent (data Set, k) clusterChanged = True while clusterChanged: clusterChanged = False for I in range (m ): # for each data point assign it to the closest centroid minDist = inf minIndex =-1 for j in range (k): distJI = distMeas (centroids [j,:], dataSet [I,:]) if distJI <minDist: minDist = distJI; minIndex = j if clusterAssment [I, 0]! = MinIndex: clusterChanged = True clusterAssment [I,:] = minIndex, minDist ** 2 print centroids for cent in range (k ): # recalculate centroids ptsInClust = dataSet [nonzero (clusterAssment [:, 0]. A = cent) [0] # get all the point in this cluster centroids [cent,:] = mean (ptsInClust, axis = 0) # assign centroid to mean return centroids, clusterAssmentdef show (dataSet, k, centroids, clusterAssment): from matplotlib import pyplot as plt numSamples, dim = dataSet. shape mark = ['or', 'ob', 'og ',' OK ',' ^ R', '+ R', 'sr', 'Dr ', '<R', 'pr'] for I in xrange (numSamples): markIndex = int (clusterAssment [I, 0]) plt. plot (dataSet [I, 0], dataSet [I, 1], mark [markIndex]) mark = ['Dr ', 'db', 'dg', 'dk ', '^ B', '+ B', 'SB ', 'db',' <B ', 'petab'] for I in range (k): plt. plot (centroids [I, 0], centroids [I, 1], mark [I], markersize = 12) plt. show () def main (): dataMat = mat(loadDataSet('testSet.txt ') myCentroids, clustAssing = kMeans (dataMat, 4) print myCentroids show (dataMat, 4, myCentroids, clustAssing) if _ name _ = '_ main _': main ()

Here is the clustering result, which is quite good.

But sometimes it will converge to the local minimum value, as shown below, it is unfortunately converged to the local optimum.

Summary

The above is all the details about how to implement the k-means clustering algorithm in python. I hope to help you. Interested friends can continue to refer to this site:

Analysis of Python memory management method and garbage collection Algorithm

Python data structure and algorithm list (linked list, linked list) Simple implementation

Evaluate the number of different Binary Trees on n nodes using the Python Algorithm

If you have any questions, you can leave a message at any time. The editor will reply to you in a timely manner. Thank you for your support!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.