Kmeans algorithm thought and its Python implementation

Last Update:2017-09-10 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The tenth chapter uses K -means clustering algorithm to group non-labeled data.

A Lead

Clustering algorithm can be regarded as a unsupervised classification method, the reason is that it is the same as the result of the classification method, the difference between its categories is not pre-defined. Cluster recognition is a concept commonly used in clustering algorithms and is used to define the results of clustering.

The clustering algorithm can be used for almost all objects, and the more similar the objects within the cluster, the better the effect.

Two the basic concept of K-means clustering algorithm

K -means clustering algorithm its purpose is to divide the data into K - clusters. Its general process is as follows:

random selection of K-points as the initial centroid

When the allocation result of any one cluster is changed

For each data point

For each centroid

Calculate the distance from the data point to the centroid

Assigns the current data point to the cluster of the nearest centroid

Calculates its centroid for each cluster

In the above process, the method of centroid calculation is generally used, and the distance calculation method can be freely selected, such as Euclidean distance, etc., but different distance measures may have different results.

K -means clustering algorithm it features:

1. Advantages: Easy calculation, simple algorithm, easy to implement

2. Disadvantage: Easy to fall into the local minimum, for large data sample convergence is slow

3. Applicable data type: Numeric data (if the nominal type of data can be considered to convert to numerical data)

Three the process of K-means algorithm

1. First read the data from the file and save it in the array.

2. Define distance function

3. Initialize the center of mass. The method used to initialize the centroid is to randomize each feature within its given range

implementation of 4.kmeans algorithm

Five Use post-processing to improve clustering performance

Because we can easily get local minimums rather than global minimums when we use clustering algorithms, we need to take some steps to improve the performance of clustering.

Of course, the simplest is to increase the number of clusters, but this violates the original purpose of our optimization, so we use a post-processing approach to optimize. The so-called post-processing refers to finding the cluster of squared errors and the largest cluster, and then splitting the cluster into two clusters, because to maintain the number of clusters, we need to find two of the wrong centroid to merge. There are two quantization methods for centroid of centroid: One is to define the nearest two centroid as the centroid of error, and the two centroid with the smallest increment of the sum of squared errors to be the error centroid.

Six two-part K- means algorithm

The binary K- means algorithm is proposed to solve the problem of local minimum of clustering algorithm. Its basic idea is to think of all the points as a cluster first, then the 2-means algorithm divides the cluster into two, then chooses one of the clusters to continue dividing, which determines whether the partition can minimize the sum of squared errors. A common approach is as follows:

All points are considered to be a cluster.

When the number of clusters is less than K

For each cluster

calculates the total error of the current cluster 1

calculates the total error after splitting the current cluster by 2

Select the cluster with the total error 1 and the total error 2 with the largest difference for the next partition

(It is also possible to select clusters that divide the total error of the cluster as the selected cluster)

Of course, there is a simpler way is to directly select the largest total error cluster as the next cluster to be divided.

based on the above algorithm, we can get the following code

defBikmeans(DataSet, K, Distmeas=disteclud):
m = shape (DataSet) [0]
Clusterassment = Mat (Zeros ((M, 2))
CENTROID0 = Mean (DataSet, axis=0). ToList () [0]
Centlist = [CENTROID0] # Create a list with one centroid
forJinchRange (M): # Calc Initial Error
Clusterassment[j, 1] = Distmeas (Mat (CENTROID0), dataset[j,:]) * * 2
while(Len (centlist) < K):
Lowestsse = inf
forIinchRange (len (centlist)):
Ptsincurrcluster = Dataset[nonzero (clusterassment[:, 0]. A = = i) [0],
:] # Get the data points currently in cluster I
Centroidmat, Splitclustass = Kmeans (Ptsincurrcluster, 2, Distmeas)
Ssesplit = SUM (splitclustass[:, 1]) # Compare the SSE to the currrent minimum
Ssenotsplit = SUM (Clusterassment[nonzero (clusterassment[:, 0]. A! = i) [0], 1])
Print"Ssesplit, and Notsplit:", Ssesplit, Ssenotsplit
off(Ssesplit + Ssenotsplit) < lowestsse:
Bestcenttosplit = i
Bestnewcents = Centroidmat
Bestclustass = Splitclustass.copy ()
Lowestsse = Ssesplit + ssenotsplit
Bestclustass[nonzero (bestclustass[:, 0]. A = = 1) [0], 0] = Len (centlist) # change 1 to 3,4, or whatever
Bestclustass[nonzero (bestclustass[:, 0]. A = = 0) [0], 0] = Bestcenttosplit
Print' The Bestcenttosplit is: ', bestcenttosplit
Print' The Len of Bestclustass is: ', Len (Bestclustass)
Centlist[bestcenttosplit] = bestnewcents[0,:].tolist () [0] # Replace a centroid with the best centroids
Centlist.append (bestnewcents[1,:].tolist () [0])
Clusterassment[nonzero (clusterassment[:, 0]. A = = Bestcenttosplit) [0],
:] = bestclustass # reassign new clusters, and SSE
returnMat (centlist), clusterassment

Seven Clustering points on a map

when clustering points on a map, the first is to get the data and analyze the data, which is omitted in two parts. Let's say we already have the data, it's stored in the places.txt file, and the fourth and fifth columns in the file are the data we need. Now we will find five clusters in the current data according to the Bikmeans algorithm , and display the results

The results of the program are as follows:

Eight Summarize

Clustering algorithm is an unsupervised algorithm, and the common clustering algorithm has K-means algorithm and binary K-means algorithm. The latter is the advanced version of the former, the effect is better than the former. Because the K-means algorithm is susceptible to initial point selection, it is easy to get into the local minimum. Of course, this is not the only clustering algorithm, clustering algorithm has a lot, there are hierarchical clustering and so on.

In general, the goal of the clustering algorithm is to find some clusters from a bunch of data without purpose. These clusters are not defined in advance, but they do show some of the characteristics of the data.

Kmeans algorithm thought and its Python implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Kmeans algorithm thought and its Python implementation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Kmeans algorithm thought and its Python implementation

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support