K-Means (K-means) Clustering algorithm

Last Update:2017-09-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Clustering is unsupervised learning, which places similar objects in the same cluster.

This article introduces a clustering algorithm called K-means, which is called K-means because it can discover k different clusters, and the center of each cluster is computed by means of the mean value of the values in the cluster.

The clustering view places similar objects in the same cluster, and groups objects that are not similar to different clusters.

The following is a simple example of how this algorithm is implemented in Python:

The Loaddataset function first imports a text file into a list and adds it to the dataset, and the result is the training data that needs to be loaded.

def loaddataset (filename):    datamat = []    fr = open (fileName) for line in    Fr.readlines ():        curline = Line.strip (). Split (' \ t ')        fltline = map (float,curline)        datamat.append (List (fltline))    return Datamat

The function disteclud is used to calculate the distance between two vectors:

def disteclud (Veca, VECB):    return sqrt (SUM (Power (VECA-VECB, 2))   # Calculate Distance def randcent (dataset,k):    n = shape ( DataSet) [1]   # Number of columns    centroids = Mat (Zeros ((k,n))  # k is the number of cluster centroid n is the number of coordinates corresponding to each point for    J in Range (n):        Minj = Min (dataset[:,j]) # Minimum value per column        Rangej = float (max (dataset[:,j])-Minj)  # change interval        centroids[:,j] = Minj + Rangej * r Andom.rand (k,1)  # generates coordinates    return centroids

The function randcent has two parameters, where k is the number of centroid that the user specifies (that is, the number of classes that are finally divided), which is the function of constructing a set of K random centroid (Centroids) for a given dataset dataset.

The above three is an auxiliary function, the following is the complete K-means algorithm:

def kmeans (dataset,k,dismeas = disteclud, createcent = randcent): M = shape (DataSet) [0] # Number of training data sets Clusterassent = Mat (Zeros (m,2)) # is used to save the centroid of each point centroids = Createcent (dataset,k) #初始化质心并保存 clusterchanged = True while Cluste            rchanged:clusterchanged = False for I in Range (m): # Traverse all data points calculate the distance from each data point to each centroid mindist = inf Minindex = 1 for j in Range (k): # Traverse all centroid points Distji = Dismeas (Centroids[j,:], DataSet                [I,:]) If Distji < Mindist:mindist = Distji Minindex = J If clusterassent[i, 0]! = minindex:clusterchanged = True # any one point corresponding to the centroid change requires a re-traversal of the computed clusterassent[i,:] = Minindex, Mindist * * 2 print (centroids) for cent in range (k): # Update the location of the centroid Ptsinclust = Dataset[nonzero (clus terassent[:,0]. A = = cent) [0]] centroids[cent:] = mean (Ptsinclust, axis= 0) return centroids,clusterassent

The idea might be:

traversal of all training data points (m data points)
- For all centroid (k centroid)
  - Calculates the distance between the data point and the centroid and saves the nearest centroid of the data point
- Compare the centroid of the data point previously saved (that is, the cluster to which the data point belongs, save in clusterassent), and if so, prove not convergent and need to loop from the beginning to the end
For each cluster, a bit of the mean in the cluster is computed as the centroid.

The K-means algorithm sometimes results in poor clustering, converging to the local minimum, not the global minimum value. A measure of the clustering effect is SSE (Sum of squared error, squared error sum), the smaller the SSE, the closer the data points to their centroid, the better the clustering effect. Therefore, to improve the results of clustering can be divided into two clusters with the largest SSE value, and in order to keep the total number of clusters, you can merge some two clusters.

K-Means (K-means) Clustering algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

K-Means (K-means) Clustering algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

K-Means (K-means) Clustering algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support