K-means clustering algorithm introduction and python-based sample code, k-meanspython
Clustering
Today we will talk about K-means clustering algorithms, but we must first understand the differences between clustering and classification. Many business personnel are not very rigorous in their daily analysis. In fact, they are essentially different.
CategoryIt is actually a process of mining patterns from specific data and making judgments. For example, if there is a spam Classifier in Gmail, nothing may be filtered at the beginning. During daily use, I manually click "spam" or "not spam" for each email. After a while, Gmail will be able to automatically filter out some spam. This is because in the selection process, each email is actually tagged with only two values, namely, "junk" or "not junk ", gmail will constantly study which features are spam and which features are not spam, and form some discriminative modes. In this way, when a mail arrives, the email can be automatically assigned to one of the two categories we set manually: "spam" and "not spam.
ClusteringThe goal is to classify the data, but I don't know how to divide it in advance. It is entirely because the algorithm itself determines the similarity between each data, and the similarity is put together. Before the conclusion of clustering, I have no idea what features each category has. We must analyze it based on the clustering results through human experience to see what features the clustering class has.
1. Overview
K-means is a very common clustering algorithm, which is often used in processing clustering tasks. K-means is a distance-based clustering algorithm that combines simplicity and classic.
Distance is used as the similarity evaluation index, that is, the closer the distance between the two objects, the larger the similarity.
This algorithm considers that a class cluster is composed of objects close to each other, so a compact and independent cluster is obtained as the final target.
2. Core Ideas
By iterative search for a division scheme of k class clusters, the mean value of these k class clusters is used to represent the minimum error of the samples.
K clusters have the following characteristics: each cluster itself is as compact as possible, and each cluster is separated as much as possible.
The k-means algorithm is based onMinimum Squared Error Criterion,
The cost function is:
In formula, μ c (I) indicates the mean of the I-th clustering.
The more similar the samples in various clusters, the smaller the square of error between the samples and the mean class. The sum of the square of error obtained by all classes can be used to verify that the samples are classified into k classes, whether each clustering is optimal.
The above-mentioned cost functions cannot be minimized using the parsing method, and only iterative methods are available.
3. algorithm steps
The K-means clustering of n sample points is displayed. Here, k is 2.
4. algorithm implementation steps
The k-means algorithm clusters samples into k clusters. k is given by the user. The solution process is intuitive and simple. The specific algorithm is described as follows:
1) randomly select k cluster center points
2) Repeat the following process until convergence {
For each sample I, calculate the class it should belong:
For each class j, recalculate the center of the class:
}
Its pseudo code is as follows:
**************************************** **************************************
Create k points as the initial center point (randomly selected)
When the cluster distribution result of any point changes
For each data point in a dataset
For each center
Calculate the distance between the center of gravity and the data point
Allocate data points to the nearest cluster
Calculate the mean value of each cluster and use the mean value as the center of the center.
**************************************** ****************
5. K-means clustering algorithm python practice
Requirements:
Clustering a given dataset
This case uses a two-dimensional dataset with a total of 80 samples and four classes.
#! /Usr/bin/python # coding = utf-8from numpy import * # load data def loadDataSet (fileName): # parse the file, split the field by tab, obtain a matrix of floating-point numbers, dataMat = [] #. The last field of the file is the category label fr = open (fileName) for line in fr. readlines (): curLine = line. strip (). split ('\ t') fltLine = map (float, curLine) # convert each element to a float type dataMat. append (fltLine) return dataMat # Calculate the Euclidean distance def distEclud (vecA, vecB): return sqrt (sum (power (vecA-vecB, 2 ))) # Finding the distance between two vectors # Structure Create a cluster center. k random centroid def randCent (dataSet, k): n = shape (dataSet) [1] centroids = mat (zeros (k, n) # Each center has n coordinate values. k centers for j in range (n) are required in total ): minJ = min (dataSet [:, j]) maxJ = max (dataSet [:, j]) rangeJ = float (maxJ-minJ) centroids [:, j] = minJ + rangeJ * random. rand (k, 1) return centroids # k-means clustering algorithm def kMeans (dataSet, k, distMeans = distEclud, createCent = randCent): m = shape (dataSet) [0] clusterAssme Nt = mat (zeros (m, 2) # used to store the sample type and centroid distance # The first column of clusterAssment stores the center of the data, the second column is the distance from the data to the central point centroids = createCent (dataSet, k) clusterChanged = True # used to determine whether the cluster has been converged while clusterChanged: clusterChanged = False; for I in range (m): # divide each data point into the nearest center point minDist = inf; minIndex =-1; for j in range (k ): distJI = distMeans (centroids [j,:], dataSet [I,:]) if distJI <minDist: minDist = distJI; minIndex = j # if When the data point I is closer to the center j, I is assigned to j if clusterAssment [I, 0]! = MinIndex: clusterChanged = True; # If the allocation changes, continue to iterate clusterAssment [I,:] = minIndex, minDist ** 2 # store the distribution of the I data point in the dictionary print centroids for cent in range (k): # recalculate the center point ptsInClust = dataSet [nonzero (clusterAssment [:, 0]. A = cent) [0] # Remove centroids [cent,:] = mean (ptsInClust, axis = 0) from all the columns whose first column is equal to cent) # Calculate the center point of the Data return centroids, clusterAssment # ------------------ test kernel # use the test data and test the kmeans algorithm datMat = mat(loadDataSet('testSet.txt ') myCentroids, clustAssing = kMeans) print myCentroidsprint clustAssing
Running result:
6. K-means algorithm supplement
Disadvantages and improvement methods of K-means algorithm
(1) the k value is selected by the user. The results of different k values are quite different, as shown in. The result of k = 3 on the left is too sparse, the blue cluster can be further divided into two clusters. The right figure shows the result of k = 5. We can see that the red and blue diamond clusters can be merged into one cluster:
Improvement:
For k selection, you can first use some algorithms to analyze the data distribution, such as the center of gravity and density, and then select the appropriate k
(2) The selection of k initial centers is sensitive and easy to fall into the local minimum value. For example, when the above algorithm runs, different results may be obtained, as shown in the following two cases. K-means also converges, but only to the local minimum value:
Improvement:
Someone proposed another bisecting k-means algorithm, which is not sensitive to the selection of k initial centers.
(3) There are limitations, such as the following non-spherical data distribution:
(4) When the dataset is large, the convergence will be slow.
Summary
The above is all the content of this article. I hope the content of this article has some reference and learning value for everyone's learning or work. If you have any questions, please leave a message to us, thank you for your support.