K-means clustering algorithm introduction and python-based sample code, k-meanspython

Source: Internet
Author: User

K-means clustering algorithm introduction and python-based sample code, k-meanspython

Clustering

Today we will talk about K-means clustering algorithms, but we must first understand the differences between clustering and classification. Many business personnel are not very rigorous in their daily analysis. In fact, they are essentially different.

CategoryIt is actually a process of mining patterns from specific data and making judgments. For example, if there is a spam Classifier in Gmail, nothing may be filtered at the beginning. During daily use, I manually click "spam" or "not spam" for each email. After a while, Gmail will be able to automatically filter out some spam. This is because in the selection process, each email is actually tagged with only two values, namely, "junk" or "not junk ", gmail will constantly study which features are spam and which features are not spam, and form some discriminative modes. In this way, when a mail arrives, the email can be automatically assigned to one of the two categories we set manually: "spam" and "not spam.

ClusteringThe goal is to classify the data, but I don't know how to divide it in advance. It is entirely because the algorithm itself determines the similarity between each data, and the similarity is put together. Before the conclusion of clustering, I have no idea what features each category has. We must analyze it based on the clustering results through human experience to see what features the clustering class has.

1. Overview

K-means is a very common clustering algorithm, which is often used in processing clustering tasks. K-means is a distance-based clustering algorithm that combines simplicity and classic.

Distance is used as the similarity evaluation index, that is, the closer the distance between the two objects, the larger the similarity.

This algorithm considers that a class cluster is composed of objects close to each other, so a compact and independent cluster is obtained as the final target.

2. Core Ideas

By iterative search for a division scheme of k class clusters, the mean value of these k class clusters is used to represent the minimum error of the samples.

K clusters have the following characteristics: each cluster itself is as compact as possible, and each cluster is separated as much as possible.

The k-means algorithm is based onMinimum Squared Error Criterion,

The cost function is:

In formula, μ c (I) indicates the mean of the I-th clustering.

The more similar the samples in various clusters, the smaller the square of error between the samples and the mean class. The sum of the square of error obtained by all classes can be used to verify that the samples are classified into k classes, whether each clustering is optimal.

The above-mentioned cost functions cannot be minimized using the parsing method, and only iterative methods are available.

3. algorithm steps

The K-means clustering of n sample points is displayed. Here, k is 2.

4. algorithm implementation steps

The k-means algorithm clusters samples into k clusters. k is given by the user. The solution process is intuitive and simple. The specific algorithm is described as follows:

1) randomly select k cluster center points

2) Repeat the following process until convergence {

For each sample I, calculate the class it should belong:

For each class j, recalculate the center of the class:

}

Its pseudo code is as follows:

**************************************** **************************************

Create k points as the initial center point (randomly selected)

When the cluster distribution result of any point changes

For each data point in a dataset

For each center

Calculate the distance between the center of gravity and the data point

Allocate data points to the nearest cluster

Calculate the mean value of each cluster and use the mean value as the center of the center.

**************************************** ****************

5. K-means clustering algorithm python practice

Requirements:

Clustering a given dataset

This case uses a two-dimensional dataset with a total of 80 samples and four classes.

#! /Usr/bin/python # coding = utf-8from numpy import * # load data def loadDataSet (fileName): # parse the file, split the field by tab, obtain a matrix of floating-point numbers, dataMat = [] #. The last field of the file is the category label fr = open (fileName) for line in fr. readlines (): curLine = line. strip (). split ('\ t') fltLine = map (float, curLine) # convert each element to a float type dataMat. append (fltLine) return dataMat # Calculate the Euclidean distance def distEclud (vecA, vecB): return sqrt (sum (power (vecA-vecB, 2 ))) # Finding the distance between two vectors # Structure Create a cluster center. k random centroid def randCent (dataSet, k): n = shape (dataSet) [1] centroids = mat (zeros (k, n) # Each center has n coordinate values. k centers for j in range (n) are required in total ): minJ = min (dataSet [:, j]) maxJ = max (dataSet [:, j]) rangeJ = float (maxJ-minJ) centroids [:, j] = minJ + rangeJ * random. rand (k, 1) return centroids # k-means clustering algorithm def kMeans (dataSet, k, distMeans = distEclud, createCent = randCent): m = shape (dataSet) [0] clusterAssme Nt = mat (zeros (m, 2) # used to store the sample type and centroid distance # The first column of clusterAssment stores the center of the data, the second column is the distance from the data to the central point centroids = createCent (dataSet, k) clusterChanged = True # used to determine whether the cluster has been converged while clusterChanged: clusterChanged = False; for I in range (m): # divide each data point into the nearest center point minDist = inf; minIndex =-1; for j in range (k ): distJI = distMeans (centroids [j,:], dataSet [I,:]) if distJI <minDist: minDist = distJI; minIndex = j # if When the data point I is closer to the center j, I is assigned to j if clusterAssment [I, 0]! = MinIndex: clusterChanged = True; # If the allocation changes, continue to iterate clusterAssment [I,:] = minIndex, minDist ** 2 # store the distribution of the I data point in the dictionary print centroids for cent in range (k): # recalculate the center point ptsInClust = dataSet [nonzero (clusterAssment [:, 0]. A = cent) [0] # Remove centroids [cent,:] = mean (ptsInClust, axis = 0) from all the columns whose first column is equal to cent) # Calculate the center point of the Data return centroids, clusterAssment # ------------------ test kernel # use the test data and test the kmeans algorithm datMat = mat(loadDataSet('testSet.txt ') myCentroids, clustAssing = kMeans) print myCentroidsprint clustAssing

Running result:

6. K-means algorithm supplement

Disadvantages and improvement methods of K-means algorithm

(1) the k value is selected by the user. The results of different k values are quite different, as shown in. The result of k = 3 on the left is too sparse, the blue cluster can be further divided into two clusters. The right figure shows the result of k = 5. We can see that the red and blue diamond clusters can be merged into one cluster:

Improvement:

For k selection, you can first use some algorithms to analyze the data distribution, such as the center of gravity and density, and then select the appropriate k

(2) The selection of k initial centers is sensitive and easy to fall into the local minimum value. For example, when the above algorithm runs, different results may be obtained, as shown in the following two cases. K-means also converges, but only to the local minimum value:

Improvement:

Someone proposed another bisecting k-means algorithm, which is not sensitive to the selection of k initial centers.

(3) There are limitations, such as the following non-spherical data distribution:

(4) When the dataset is large, the convergence will be slow.

Summary

The above is all the content of this article. I hope the content of this article has some reference and learning value for everyone's learning or work. If you have any questions, please leave a message to us, thank you for your support.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.