Principle analysis and code implementation of K-means clustering algorithm

Last Update:2017-10-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Transfer from Mu Chen

Read Catalogue

Objective
The problem of clustering analysis in reality--presidential election
K-means Clustering algorithm
K-means Performance Optimization
Two-point K-means algorithm
Summary

Back to the top of the preface

In the previous article, the machine learning algorithms involved are supervised learning algorithms.

The so-called supervised learning, is the training process of learning. More precisely, there is a "classification tag set" of learning.

From now on, we will enter the field of unsupervised learning. Discuss the problem of classical clustering. The so-called clustering, is not aware of the specific classification scheme classification (allow to know the number of categories).

This paper introduces one of the most classical clustering algorithm-K-means clustering algorithm and its two implementations.

Back to the top reality clustering problem-presidential election

Assuming that M-nation has started to elect the president again, the MR.OBM turnout is now 48% (a percentage of all voters), while Mr.mkn is 47%, while the remainder is not voted for "various reasons".

As one of those camps, it is natural for them to be able to get as many of these remaining votes as possible-because it could affect the outcome of the final election altogether.

However, you cannot get all of these people to vote, because you meet a certain group of people, may hurt the interests of another group of people.

A good idea is to divide these people into K-groups and then work primarily on the groups with the largest number of them.

This requires the use of clustering strategies.

Clustering strategy is to collect the remaining voter's user information (a variety of satisfactory/dissatisfied information), the information into the clustering algorithm, and then the cluster results in the largest number of clusters of voters do ideological work.

You may find that a cluster of voters is a community, a religious belief, or something in common. This will facilitate a variety of canvassing activities.

K, refers to it can be found K clusters, Means, refers to the cluster Center is used to calculate the mean value of the cluster.

The pseudo-code is given below:

1 Creating K points as the starting centroid (Random selection): 2     when the cluster allocation result of any one point changes: 3         to each data point in the dataset: 4             for each centroid: 5 calculates the                 distance between the centroid and the data point 6             Assign a data point to the nearest cluster 7         for each cluster: 8 to find the             mean and update it to centroid

Then there is a specific implementation of the Python program:

  1 #!/usr/bin/env Python 2 #-*-coding:utf-8-*-3 4 "' 5 Created on 20**-**-** 6 7 @author: Fangmeng 8" ' 9 From numpy Import * 11 12 #================================== 13 # Input: # filename: Data file name (with path) 15 # output: 16     # Datamat: Data set #================================== def loaddataset (fileName): 19 ' Load data file ' 20 21 Datamat = [] All FR = Open (fileName), Fr.readlines (): CurLine = Line.strip (). Split (' \ t ') 2 5 Fltline = map (float,curline) datamat.append (fltline) Datamat 28 29 #================== ================================ 30 # Input: # Veca: Sample A + # VECB: Sample B 33 # Output: # sqrt (SUM (Power ( VECA-VECB, 2)): Sample distance from #================================================== to Def disteclud (Veca, VECB): 37 ' Calculate sample distance From ' sqrt return ' (SUM (Power (VECA-VECB, 2)) 40 41 #=========================================== 42 # Input: 43 # DataSet: Data set 44 # k: Number of clusters 45 # output: # centroids: Cluster partition set (cluster centroid per element) #=========================================== def RA Ndcent (DataSet, K): 49 ' Random initialization centroid ' n = shape (DataSet) [1] centroids = Mat (zeros (k,n)) #create cent Roid Mat-for-J in Range (n): #create random cluster centers, within bounds of each dimension Minj = min (da TASET[:,J]) Rangej = float (max (dataset[:,j))-Minj) centroids[:,j] = Mat (Minj + Rangej * Random.ra nd (k,1)) return centroids 58 59 #=========================================== 60 # Input: Block # DataSet: Data set 62 # k: Number of Clusters # Distmeas: Distance Generator # Createcent: centroid Generator 65 # Output: # centroids: Cluster partition set (each element is cluster centroid) # Clusterassment: Clustering Results #===========================================-def kmeans (DataSet, K, Distmeas=distec Lud, Createcent=randcent): 0 ' K-means basic implementation ' m = shape (DataSet) [] 73 # Cluster allocation result matrix. One column is the result of cluster classification, one column is error. Clusterassment = Mat (Zeros ((m,2)) 75 # Create the original centroid set centroids = Createcent (DataSet, K) 77 # Cluster Change token + clusterchanged = True 79 While clusterchanged:81 clusterchanged = False 82 83 # Each sample point joins its nearest cluster. For I in range (m): Mindist = inf;                 Minindex = 1 to J in range (k): Distji = Distmeas (Centroids[j,:],dataset[i,:]) 88 If Distji < mindist:89 mindist = Distji;  Minindex = J-clusterassment[i,0]! = minindex:clusterchanged = True clusterassment[i,:] =             MININDEX,MINDIST**2 92 93 # Update cluster 94 for cent in range (k): #recalculate centroids 95 Ptsinclust = Dataset[nonzero (clusterassment[:,0]. A==cent) [0]] centroids[cent,:] = mean (Ptsinclust, axis=0) 98 return centroids, cluster Assment def Main (): 101 ' k-means cluster operation Show ' 102 103 Datmat = Mat (Loaddataset ('/home/Fangmeng/testset.txt ') 104 mycentroids, clustassing = Kmeans (Datmat, 4) 106 #print myCentroids107 Print clustAssing108 109 if __name__ = = "__main__": Main ()

Test results:

There are two main ways of doing this:

1. Decomposition of the largest SSE (squared error sum) of the cluster.

PS: The K-means cluster of k=2 is executed directly within the cluster.

2. Combine the smallest clusters or the two clusters with the smallest increase in SSE.

Based on these two basic optimization strategies, there is a more scientific clustering algorithm-two-point K-means algorithm, which is described in detail below.

Back to the top of the binary K-means algorithm

The algorithm has the following idea: First, all the points are used as a cluster, and then the cluster is divided into one. Then select one of the clusters to continue dividing.

The selection method is naturally the way to choose SSE to increase the smaller.

So continuously "fission" until a user-specified number of clusters is obtained.

Pseudo code:

1 treats all points as one cluster: 2     when the number of clusters is less than K: 3         for each cluster: 4             calculates the             K-means cluster 6 that SSE5 k=2 on a given cluster and             divides the cluster         into a SSE7 Select the cluster with the smallest error to divide the operation

Specific implementation functions:

 1 #====================================== 2 # Input: 3 # DataSet: Data set 4 # K: Number of clusters 5 # Distmeas: Distance generator 6 # Output: 7 # Mat (centlist): Cluster partition set (cluster centroid per element) 8 # clusterassment: Cluster result 9 #====================================== def bikmeans (DataSet, K, Distmeas=disteclud): 11 ' Binary K-means Clustering algorithm ', ' m = shape (DataSet) [0]14 # cluster result data structure Clusterassment = Mat (zeros (m,2)) 16 # original centroid of centroid0 = Mean (DataSet, axis=0). ToList () [0]18 centlist =[centroid0]19 20 # Statistics original SSE21 for J in Range (m): clusterassment[j,1] = Distmeas (Mat (CENTROID0), data         Set[j,:]) **223 24 # loop execution until get K cluster while (LEN (centlist) < K): 26 # minimum SSE27 Lowestsse = inf28 # Find the most suitable division of the cluster to split the for I in Range (len (centlist)): Ptsincurrcluster = Dataset[nonzero (Clustera ssment[:,0]. a==i) [0],:]31 centroidmat, Splitclustass = Kmeans (Ptsincurrcluster, 2, distmeas), ssesplit = SUM ( Splitclustass[:,1]) Ssenotsplit = SUM (Clusterassment[nonzero (clusterassment[:,0). a!=i) [0],1]) (Ssesplit + ssenotsplit) < lowestsse:36 Bestcenttosplit = i37 bestnewcents = centroidMat38 Bestclustass = splitclustass.copy () Estsse = Ssesplit + SSENOTSPLIT40 41 # This information is divided into Bestclustass[nonzero (bestclustass[:,0]. A = = 1) [0],0] = Len (centlist) Bestclustass[nonzero (bestclustass[:,0]. A = = 0) [0],0] = bestCentToSplit44 45 # Update cluster set Centlist[bestcenttosplit] = Bestnewcents[0,:].tolist () [0]47 centlist.append (Bestnewcents[1,:].tolist () [0]) 48 # Update cluster result set for Clusterassment[nonzero (cluste rassment[:,0]. A = = bestcenttosplit) [0],:]= BestClustAss50 return Mat (centlist), clusterassment

Test results:

Back to the top of the summary

1. Kmeans is widely used, for example: If you plan to travel to 100 cities in China, how do you plan your route?

---> Clustering can be used to gather these cities into several clusters, and then a "cluster" of a "cluster" to play. The centroid is equivalent to the airport, and the squared error is equal to the distance from the city to the centroid:)

2. The Kmeans algorithm is a very common clustering algorithm, however, here also mention its shortcomings: the initial centroid and K value of the designation of the results have a greater impact. This topic also derived a lot of research papers, interested readers can further study.

Principle analysis and code implementation of K-means clustering algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Principle analysis and code implementation of K-means clustering algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Principle analysis and code implementation of K-means clustering algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support