K-Means (K-means) Clustering algorithm

Source: Internet
Author: User

Clustering is unsupervised learning, which places similar objects in the same cluster.

This article introduces a clustering algorithm called K-means, which is called K-means because it can discover k different clusters, and the center of each cluster is computed by means of the mean value of the values in the cluster.

The clustering view places similar objects in the same cluster, and groups objects that are not similar to different clusters.

The following is a simple example of how this algorithm is implemented in Python:

The Loaddataset function first imports a text file into a list and adds it to the dataset, and the result is the training data that needs to be loaded.

def loaddataset (filename):    datamat = []    fr = open (fileName) for line in    Fr.readlines ():        curline = Line.strip (). Split (' \ t ')        fltline = map (float,curline)        datamat.append (List (fltline))    return Datamat

The function disteclud is used to calculate the distance between two vectors:

def disteclud (Veca, VECB):    return sqrt (SUM (Power (VECA-VECB, 2))   # Calculate Distance def randcent (dataset,k):    n = shape ( DataSet) [1]   # Number of columns    centroids = Mat (Zeros ((k,n))  # k is the number of cluster centroid n is the number of coordinates corresponding to each point for    J in Range (n):        Minj = Min (dataset[:,j]) # Minimum value per column        Rangej = float (max (dataset[:,j])-Minj)  # change interval        centroids[:,j] = Minj + Rangej * r Andom.rand (k,1)  # generates coordinates    return centroids

The function randcent has two parameters, where k is the number of centroid that the user specifies (that is, the number of classes that are finally divided), which is the function of constructing a set of K random centroid (Centroids) for a given dataset dataset.

The above three is an auxiliary function, the following is the complete K-means algorithm:

def kmeans (dataset,k,dismeas = disteclud, createcent = randcent): M = shape (DataSet) [0] # Number of training data sets Clusterassent = Mat (Zeros (m,2)) # is used to save the centroid of each point centroids = Createcent (dataset,k) #初始化质心并保存 clusterchanged = True while Cluste            rchanged:clusterchanged = False for I in Range (m): # Traverse all data points calculate the distance from each data point to each centroid mindist = inf Minindex = 1 for j in Range (k): # Traverse all centroid points Distji = Dismeas (Centroids[j,:], DataSet                [I,:]) If Distji < Mindist:mindist = Distji Minindex = J If clusterassent[i, 0]! = minindex:clusterchanged = True # any one point corresponding to the centroid change requires a re-traversal of the computed clusterassent[i,:] = Minindex, Mindist * * 2 print (centroids) for cent in range (k): # Update the location of the centroid Ptsinclust = Dataset[nonzero (clus terassent[:,0]. A = = cent) [0]] centroids[cent:] = mean (Ptsinclust, axis= 0) return centroids,clusterassent

The idea might be:

  • traversal of all training data points (m data points)
    • For all centroid (k centroid)
      • Calculates the distance between the data point and the centroid and saves the nearest centroid of the data point
    • Compare the centroid of the data point previously saved (that is, the cluster to which the data point belongs, save in clusterassent), and if so, prove not convergent and need to loop from the beginning to the end
  • For each cluster, a bit of the mean in the cluster is computed as the centroid.

The K-means algorithm sometimes results in poor clustering, converging to the local minimum, not the global minimum value. A measure of the clustering effect is SSE (Sum of squared error, squared error sum), the smaller the SSE, the closer the data points to their centroid, the better the clustering effect. Therefore, to improve the results of clustering can be divided into two clusters with the largest SSE value, and in order to keep the total number of clusters, you can merge some two clusters.

K-Means (K-means) Clustering algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.