Transfer from Mu Chen
Read Catalogue
- Objective
- The problem of clustering analysis in reality--presidential election
- K-means Clustering algorithm
- K-means Performance Optimization
- Two-point K-means algorithm
- Summary
Back to the top of the preface
In the previous article, the machine learning algorithms involved are supervised learning algorithms.
The so-called supervised learning, is the training process of learning. More precisely, there is a "classification tag set" of learning.
From now on, we will enter the field of unsupervised learning. Discuss the problem of classical clustering. The so-called clustering, is not aware of the specific classification scheme classification (allow to know the number of categories).
This paper introduces one of the most classical clustering algorithm-K-means clustering algorithm and its two implementations.
Back to the top reality clustering problem-presidential election
Assuming that M-nation has started to elect the president again, the MR.OBM turnout is now 48% (a percentage of all voters), while Mr.mkn is 47%, while the remainder is not voted for "various reasons".
As one of those camps, it is natural for them to be able to get as many of these remaining votes as possible-because it could affect the outcome of the final election altogether.
However, you cannot get all of these people to vote, because you meet a certain group of people, may hurt the interests of another group of people.
A good idea is to divide these people into K-groups and then work primarily on the groups with the largest number of them.
This requires the use of clustering strategies.
Clustering strategy is to collect the remaining voter's user information (a variety of satisfactory/dissatisfied information), the information into the clustering algorithm, and then the cluster results in the largest number of clusters of voters do ideological work.
You may find that a cluster of voters is a community, a religious belief, or something in common. This will facilitate a variety of canvassing activities.
Back to top K-means clustering algorithm
K, refers to it can be found K clusters, Means, refers to the cluster Center is used to calculate the mean value of the cluster.
The pseudo-code is given below:
1 Creating K points as the starting centroid (Random selection): 2 when the cluster allocation result of any one point changes: 3 to each data point in the dataset: 4 for each centroid: 5 calculates the distance between the centroid and the data point 6 Assign a data point to the nearest cluster 7 for each cluster: 8 to find the mean and update it to centroid
Then there is a specific implementation of the Python program:
1 #!/usr/bin/env Python 2 #-*-coding:utf-8-*-3 4 "' 5 Created on 20**-**-** 6 7 @author: Fangmeng 8" ' 9 From numpy Import * 11 12 #================================== 13 # Input: # filename: Data file name (with path) 15 # output: 16 # Datamat: Data set #================================== def loaddataset (fileName): 19 ' Load data file ' 20 21 Datamat = [] All FR = Open (fileName), Fr.readlines (): CurLine = Line.strip (). Split (' \ t ') 2 5 Fltline = map (float,curline) datamat.append (fltline) Datamat 28 29 #================== ================================ 30 # Input: # Veca: Sample A + # VECB: Sample B 33 # Output: # sqrt (SUM (Power ( VECA-VECB, 2)): Sample distance from #================================================== to Def disteclud (Veca, VECB): 37 ' Calculate sample distance From ' sqrt return ' (SUM (Power (VECA-VECB, 2)) 40 41 #=========================================== 42 # Input: 43 # DataSet: Data set 44 # k: Number of clusters 45 # output: # centroids: Cluster partition set (cluster centroid per element) #=========================================== def RA Ndcent (DataSet, K): 49 ' Random initialization centroid ' n = shape (DataSet) [1] centroids = Mat (zeros (k,n)) #create cent Roid Mat-for-J in Range (n): #create random cluster centers, within bounds of each dimension Minj = min (da TASET[:,J]) Rangej = float (max (dataset[:,j))-Minj) centroids[:,j] = Mat (Minj + Rangej * Random.ra nd (k,1)) return centroids 58 59 #=========================================== 60 # Input: Block # DataSet: Data set 62 # k: Number of Clusters # Distmeas: Distance Generator # Createcent: centroid Generator 65 # Output: # centroids: Cluster partition set (each element is cluster centroid) # Clusterassment: Clustering Results #===========================================-def kmeans (DataSet, K, Distmeas=distec Lud, Createcent=randcent): 0 ' K-means basic implementation ' m = shape (DataSet) [] 73 # Cluster allocation result matrix. One column is the result of cluster classification, one column is error. Clusterassment = Mat (Zeros ((m,2)) 75 # Create the original centroid set centroids = Createcent (DataSet, K) 77 # Cluster Change token + clusterchanged = True 79 While clusterchanged:81 clusterchanged = False 82 83 # Each sample point joins its nearest cluster. For I in range (m): Mindist = inf; Minindex = 1 to J in range (k): Distji = Distmeas (Centroids[j,:],dataset[i,:]) 88 If Distji < mindist:89 mindist = Distji; Minindex = J-clusterassment[i,0]! = minindex:clusterchanged = True clusterassment[i,:] = MININDEX,MINDIST**2 92 93 # Update cluster 94 for cent in range (k): #recalculate centroids 95 Ptsinclust = Dataset[nonzero (clusterassment[:,0]. A==cent) [0]] centroids[cent,:] = mean (Ptsinclust, axis=0) 98 return centroids, cluster Assment def Main (): 101 ' k-means cluster operation Show ' 102 103 Datmat = Mat (Loaddataset ('/home/Fangmeng/testset.txt ') 104 mycentroids, clustassing = Kmeans (Datmat, 4) 106 #print myCentroids107 Print clustAssing108 109 if __name__ = = "__main__": Main ()
Test results:
Back to top K-means performance optimization
There are two main ways of doing this:
1. Decomposition of the largest SSE (squared error sum) of the cluster.
PS: The K-means cluster of k=2 is executed directly within the cluster.
2. Combine the smallest clusters or the two clusters with the smallest increase in SSE.
Based on these two basic optimization strategies, there is a more scientific clustering algorithm-two-point K-means algorithm, which is described in detail below.
Back to the top of the binary K-means algorithm
The algorithm has the following idea: First, all the points are used as a cluster, and then the cluster is divided into one. Then select one of the clusters to continue dividing.
The selection method is naturally the way to choose SSE to increase the smaller.
So continuously "fission" until a user-specified number of clusters is obtained.
Pseudo code:
1 treats all points as one cluster: 2 when the number of clusters is less than K: 3 for each cluster: 4 calculates the K-means cluster 6 that SSE5 k=2 on a given cluster and divides the cluster into a SSE7 Select the cluster with the smallest error to divide the operation
Specific implementation functions:
1 #====================================== 2 # Input: 3 # DataSet: Data set 4 # K: Number of clusters 5 # Distmeas: Distance generator 6 # Output: 7 # Mat (centlist): Cluster partition set (cluster centroid per element) 8 # clusterassment: Cluster result 9 #====================================== def bikmeans (DataSet, K, Distmeas=disteclud): 11 ' Binary K-means Clustering algorithm ', ' m = shape (DataSet) [0]14 # cluster result data structure Clusterassment = Mat (zeros (m,2)) 16 # original centroid of centroid0 = Mean (DataSet, axis=0). ToList () [0]18 centlist =[centroid0]19 20 # Statistics original SSE21 for J in Range (m): clusterassment[j,1] = Distmeas (Mat (CENTROID0), data Set[j,:]) **223 24 # loop execution until get K cluster while (LEN (centlist) < K): 26 # minimum SSE27 Lowestsse = inf28 # Find the most suitable division of the cluster to split the for I in Range (len (centlist)): Ptsincurrcluster = Dataset[nonzero (Clustera ssment[:,0]. a==i) [0],:]31 centroidmat, Splitclustass = Kmeans (Ptsincurrcluster, 2, distmeas), ssesplit = SUM ( Splitclustass[:,1]) Ssenotsplit = SUM (Clusterassment[nonzero (clusterassment[:,0). a!=i) [0],1]) (Ssesplit + ssenotsplit) < lowestsse:36 Bestcenttosplit = i37 bestnewcents = centroidMat38 Bestclustass = splitclustass.copy () Estsse = Ssesplit + SSENOTSPLIT40 41 # This information is divided into Bestclustass[nonzero (bestclustass[:,0]. A = = 1) [0],0] = Len (centlist) Bestclustass[nonzero (bestclustass[:,0]. A = = 0) [0],0] = bestCentToSplit44 45 # Update cluster set Centlist[bestcenttosplit] = Bestnewcents[0,:].tolist () [0]47 centlist.append (Bestnewcents[1,:].tolist () [0]) 48 # Update cluster result set for Clusterassment[nonzero (cluste rassment[:,0]. A = = bestcenttosplit) [0],:]= BestClustAss50 return Mat (centlist), clusterassment
Test results:
Back to the top of the summary
1. Kmeans is widely used, for example: If you plan to travel to 100 cities in China, how do you plan your route?
---> Clustering can be used to gather these cities into several clusters, and then a "cluster" of a "cluster" to play. The centroid is equivalent to the airport, and the squared error is equal to the distance from the city to the centroid:)
2. The Kmeans algorithm is a very common clustering algorithm, however, here also mention its shortcomings: the initial centroid and K value of the designation of the results have a greater impact. This topic also derived a lot of research papers, interested readers can further study.
Principle analysis and code implementation of K-means clustering algorithm