Kmeans algorithm thought and its Python implementation

Source: Internet
Author: User

The tenth chapter uses K -means clustering algorithm to group non-labeled data.

A Lead

Clustering algorithm can be regarded as a unsupervised classification method, the reason is that it is the same as the result of the classification method, the difference between its categories is not pre-defined. Cluster recognition is a concept commonly used in clustering algorithms and is used to define the results of clustering.

The clustering algorithm can be used for almost all objects, and the more similar the objects within the cluster, the better the effect.

Two the basic concept of K-means clustering algorithm

K -means clustering algorithm its purpose is to divide the data into K - clusters. Its general process is as follows:

random selection of K-points as the initial centroid

When the allocation result of any one cluster is changed

For each data point

For each centroid

Calculate the distance from the data point to the centroid

Assigns the current data point to the cluster of the nearest centroid

Calculates its centroid for each cluster

In the above process, the method of centroid calculation is generally used, and the distance calculation method can be freely selected, such as Euclidean distance, etc., but different distance measures may have different results.

K -means clustering algorithm it features:

1. Advantages: Easy calculation, simple algorithm, easy to implement

2. Disadvantage: Easy to fall into the local minimum, for large data sample convergence is slow

3. Applicable data type: Numeric data (if the nominal type of data can be considered to convert to numerical data)

Three the process of K-means algorithm

1. First read the data from the file and save it in the array.

2. Define distance function

3. Initialize the center of mass. The method used to initialize the centroid is to randomize each feature within its given range

implementation of 4.kmeans algorithm

Five Use post-processing to improve clustering performance

Because we can easily get local minimums rather than global minimums when we use clustering algorithms, we need to take some steps to improve the performance of clustering.

Of course, the simplest is to increase the number of clusters, but this violates the original purpose of our optimization, so we use a post-processing approach to optimize. The so-called post-processing refers to finding the cluster of squared errors and the largest cluster, and then splitting the cluster into two clusters, because to maintain the number of clusters, we need to find two of the wrong centroid to merge. There are two quantization methods for centroid of centroid: One is to define the nearest two centroid as the centroid of error, and the two centroid with the smallest increment of the sum of squared errors to be the error centroid.

Six two-part K- means algorithm

The binary K- means algorithm is proposed to solve the problem of local minimum of clustering algorithm. Its basic idea is to think of all the points as a cluster first, then the 2-means algorithm divides the cluster into two, then chooses one of the clusters to continue dividing, which determines whether the partition can minimize the sum of squared errors. A common approach is as follows:


All points are considered to be a cluster.

When the number of clusters is less than K

For each cluster

calculates the total error of the current cluster 1

calculates the total error after splitting the current cluster by 2

Select the cluster with the total error 1 and the total error 2 with the largest difference for the next partition

(It is also possible to select clusters that divide the total error of the cluster as the selected cluster)

Of course, there is a simpler way is to directly select the largest total error cluster as the next cluster to be divided.

based on the above algorithm, we can get the following code

defBikmeans(DataSet, K, Distmeas=disteclud):
m = shape (DataSet) [0]
Clusterassment = Mat (Zeros ((M, 2))
CENTROID0 = Mean (DataSet, axis=0). ToList () [0]
Centlist = [CENTROID0] # Create a list with one centroid
forJinchRange (M): # Calc Initial Error
Clusterassment[j, 1] = Distmeas (Mat (CENTROID0), dataset[j,:]) * * 2
while(Len (centlist) < K):
Lowestsse = inf
forIinchRange (len (centlist)):
Ptsincurrcluster = Dataset[nonzero (clusterassment[:, 0]. A = = i) [0],
:] # Get the data points currently in cluster I
Centroidmat, Splitclustass = Kmeans (Ptsincurrcluster, 2, Distmeas)
Ssesplit = SUM (splitclustass[:, 1]) # Compare the SSE to the currrent minimum
Ssenotsplit = SUM (Clusterassment[nonzero (clusterassment[:, 0]. A! = i) [0], 1])
Print"Ssesplit, and Notsplit:", Ssesplit, Ssenotsplit
off(Ssesplit + Ssenotsplit) < lowestsse:
Bestcenttosplit = i
Bestnewcents = Centroidmat
Bestclustass = Splitclustass.copy ()
Lowestsse = Ssesplit + ssenotsplit
Bestclustass[nonzero (bestclustass[:, 0]. A = = 1) [0], 0] = Len (centlist) # change 1 to 3,4, or whatever
Bestclustass[nonzero (bestclustass[:, 0]. A = = 0) [0], 0] = Bestcenttosplit
Print' The Bestcenttosplit is: ', bestcenttosplit
Print' The Len of Bestclustass is: ', Len (Bestclustass)
Centlist[bestcenttosplit] = bestnewcents[0,:].tolist () [0] # Replace a centroid with the best centroids
Centlist.append (bestnewcents[1,:].tolist () [0])
Clusterassment[nonzero (clusterassment[:, 0]. A = = Bestcenttosplit) [0],
:] = bestclustass # reassign new clusters, and SSE
returnMat (centlist), clusterassment

Seven Clustering points on a map

when clustering points on a map, the first is to get the data and analyze the data, which is omitted in two parts. Let's say we already have the data, it's stored in the places.txt file, and the fourth and fifth columns in the file are the data we need. Now we will find five clusters in the current data according to the Bikmeans algorithm , and display the results

The results of the program are as follows:

Eight Summarize

Clustering algorithm is an unsupervised algorithm, and the common clustering algorithm has K-means algorithm and binary K-means algorithm. The latter is the advanced version of the former, the effect is better than the former. Because the K-means algorithm is susceptible to initial point selection, it is easy to get into the local minimum. Of course, this is not the only clustering algorithm, clustering algorithm has a lot, there are hierarchical clustering and so on.

In general, the goal of the clustering algorithm is to find some clusters from a bunch of data without purpose. These clusters are not defined in advance, but they do show some of the characteristics of the data.

Kmeans algorithm thought and its Python implementation

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.