Clustering is unsupervised learning, which places similar objects in the same cluster.
This article introduces a clustering algorithm called K-means, which is called K-means because it can discover k different clusters, and the center of each cluster is computed by means of the mean value of the values in the cluster.
The clustering view places similar objects in the same cluster, and groups objects that are not similar to different clusters.
The following is a simple example of how this algorithm is implemented in Python:
The Loaddataset function first imports a text file into a list and adds it to the dataset, and the result is the training data that needs to be loaded.
def loaddataset (filename): datamat = [] fr = open (fileName) for line in Fr.readlines (): curline = Line.strip (). Split (' \ t ') fltline = map (float,curline) datamat.append (List (fltline)) return Datamat
The function disteclud is used to calculate the distance between two vectors:
def disteclud (Veca, VECB): return sqrt (SUM (Power (VECA-VECB, 2)) # Calculate Distance def randcent (dataset,k): n = shape ( DataSet) [1] # Number of columns centroids = Mat (Zeros ((k,n)) # k is the number of cluster centroid n is the number of coordinates corresponding to each point for J in Range (n): Minj = Min (dataset[:,j]) # Minimum value per column Rangej = float (max (dataset[:,j])-Minj) # change interval centroids[:,j] = Minj + Rangej * r Andom.rand (k,1) # generates coordinates return centroids
The function randcent has two parameters, where k is the number of centroid that the user specifies (that is, the number of classes that are finally divided), which is the function of constructing a set of K random centroid (Centroids) for a given dataset dataset.
The above three is an auxiliary function, the following is the complete K-means algorithm:
def kmeans (dataset,k,dismeas = disteclud, createcent = randcent): M = shape (DataSet) [0] # Number of training data sets Clusterassent = Mat (Zeros (m,2)) # is used to save the centroid of each point centroids = Createcent (dataset,k) #初始化质心并保存 clusterchanged = True while Cluste rchanged:clusterchanged = False for I in Range (m): # Traverse all data points calculate the distance from each data point to each centroid mindist = inf Minindex = 1 for j in Range (k): # Traverse all centroid points Distji = Dismeas (Centroids[j,:], DataSet [I,:]) If Distji < Mindist:mindist = Distji Minindex = J If clusterassent[i, 0]! = minindex:clusterchanged = True # any one point corresponding to the centroid change requires a re-traversal of the computed clusterassent[i,:] = Minindex, Mindist * * 2 print (centroids) for cent in range (k): # Update the location of the centroid Ptsinclust = Dataset[nonzero (clus terassent[:,0]. A = = cent) [0]] centroids[cent:] = mean (Ptsinclust, axis= 0) return centroids,clusterassent
The idea might be:
- traversal of all training data points (m data points)
- For all centroid (k centroid)
- Calculates the distance between the data point and the centroid and saves the nearest centroid of the data point
- Compare the centroid of the data point previously saved (that is, the cluster to which the data point belongs, save in clusterassent), and if so, prove not convergent and need to loop from the beginning to the end
- For each cluster, a bit of the mean in the cluster is computed as the centroid.
The K-means algorithm sometimes results in poor clustering, converging to the local minimum, not the global minimum value. A measure of the clustering effect is SSE (Sum of squared error, squared error sum), the smaller the SSE, the closer the data points to their centroid, the better the clustering effect. Therefore, to improve the results of clustering can be divided into two clusters with the largest SSE value, and in order to keep the total number of clusters, you can merge some two clusters.
K-Means (K-means) Clustering algorithm