Kmeans ++ algorithm instance implemented by Python

Source: Internet
Author: User
This article mainly introduces the Kmeans and kmeans ++ algorithms, explains the shortcomings of the Kmeans algorithm and the implementation ideas of the kmeans ++ algorithm, as well as the Kmeans ++ algorithms implemented in Python and matlab, for more information, see 1. start with Kmeans

Kmeans is a very basic clustering algorithm that uses the idea of iteration. we will not talk about its principles here. The following describes how to use the kmeans algorithm in matlab.

Create seven two-dimensional data points:

The code is as follows:

X = [randn (3, 2) *. 4; randn (4, 2) *. 5 + ones (4, 1) * [4 4];


Use the kmeans function:

The code is as follows:

Class = kmeans (x, 2 );


X indicates a data point. each row of x indicates a data point. 2 indicates that there must be two centers, that is, the cluster result must have two clusters. Class is a column vector with 70 elements. these elements correspond to 70 data points in sequence, and the element value represents the classification number of the corresponding data points. After a running operation, the value of the class is:

The code is as follows:


2
2
2
1
1
1
1


This indicates that the first three data points of x belong to cluster 2, and the last four data points belong to cluster 1. The kmeans function can also be used as follows:

The code is as follows:


> [Class, C, sumd, D] = kmeans (x, 2)

Class =
2
2
2
1
1
1
1

C =
4.0629 4.0845
-0.1341 0.1201

Sumd =
1.2017
0.2939

D =
34.3727 0.0184
29.5644 0.1858
36.3511 0.0898
0.1247 37.4801
0.7537 24.0659
0.1979 36.7666
0.1256 36.2149

Class still represents the classification of each data point; C contains the final center point, and one row represents a central point; sumd represents the sum of the distance between each center point and each data point in the cluster; each row of D also corresponds to a data point. The values in the row are the distance between the data point and the center. Kmeans uses Euclidean distance by default (see [3]).. The distance used by the kmeans function can also be the Manhattan distance (L1-distance) and other types of distance, which can be specified by adding parameters.

Kmeans has several disadvantages (which are described in many materials ):

1. the number of categories of the final cluster (that is, the number of the center point or the number of the seed points) k is not necessarily known in advance, so it is a problem to choose a proper k value.
2. the quality of the initial seed point will affect the clustering result.
3. sensitive to noise and outlier.
4.

2. basic idea of kmeans ++ algorithm

The main work of the kmeans ++ algorithm lies in the selection of seed points. The basic principle is to make the distance between seed points as large as possible, but eliminate the influence of noise. The basic idea is as follows:

1. randomly select a vertex from the input data point set (k clustering is required) as the first clustering center.
2. for each vertex x in the dataset, calculate the distance between it and the nearest cluster center (the selected cluster center) D (x)
3. select a new data point as the new cluster center. the selection principle is: D (x) is larger, and the probability of being selected as the cluster center is higher.
4. Repeat 2 and 3 until k clustering centers are selected.
5. use these k initial clustering centers to run the standard k-means algorithm.

Assume that X of the data point set has n data points, and X (1), X (2 ),...... And X (n). Then, in step 1, calculate the distance between each data point and the nearest seed point (cluster center), and obtain D (1) in sequence), D (2 ),...... And D (n. In D, to avoid noise, you cannot directly select the element with the largest value. you should select an element with a large value and use the corresponding data point as the seed point.

How can we select elements with a large value? The following is an idea (the initial source has not been found yet and is mentioned in [2] and other places, I changed my mind to a better understanding): think of every element D (x) in set D as a root line L (x), and the length of the line is the value of the element. Sort these lines by L (1), L (2 ),...... And L (n) are connected sequentially to form a long line. L (1), L (2 ),...... , L (n) is called the child line of L. Based on the knowledge of probability, if we randomly select a vertex on L, the child line of the vertex is likely to be a long Child Line, the data point corresponding to this subline can be used as the seed point. This principle applies to the two kmeans ++ implementations below.

3. python kmeans ++

The Kmeans ++ implementation of multiple programming languages can be found in http://rosettacode.org/wiki/K-means%2B%2B_clustering. The following content is based on python (the Chinese annotation is added by the author ):

The code is as follows:


From math import pi, sin, cos
From collections import namedtuple
From random import random, choice
From copy import copy

Try:
Import psyco
Psyco. full ()
Failed T ImportError:
Pass

FLOAT_MAX = 1e100

Class Point:
_ Slots _ = ["x", "y", "group"]
Def _ init _ (self, x = 0.0, y = 0.0, group = 0 ):
Self. x, self. y, self. group = x, y, group

Def generate_points (npoints, radius ):
Points = [Point () for _ in xrange (npoints)]

# Note: this is not a uniform 2-d distribution
For p in points:
R = random () * radius
Ang = random () * 2 * pi
P. x = r * cos (ang)
P. y = r * sin (ang)

Return points

Def nearest_cluster_center (point, cluster_centers ):
"Distance and index of the closest cluster center """
Def sqr_distance_2D (a, B ):
Return (a. x-B. x) ** 2 + (a. y-B. y) ** 2

Min_index = point. group
Min_dist = FLOAT_MAX

For I, cc in enumerate (cluster_centers ):
D = sqr_distance_2D (cc, point)
If min_dist> d:
Min_dist = d
Min_index = I

Return (min_index, min_dist)

'''
Points is a data point, and nclusters is the specified number of cluster classes.
Cluster_centers contains the initialized nclusters centers, starting with all objects-> (0, 0)
'''

Def kpp (points, cluster_centers ):
Cluster_centers [0] = copy (choice (points) # randomly select the first center point
D = [0.0 for _ in xrange (len (points)] # List, with the length of len (points). Save the distance between each vertex and its closest center.

For I in xrange (1, len (cluster_centers): # I = 1 .. len (c_c)-1
Sum = 0
For j, p in enumerate (points ):
D [j] = nearest_cluster_center (p, cluster_centers [: I]) [1] # minimum distance between data point p and center
Sum + = d [j]

Sum * = random ()

For j, di in enumerate (d ):
Sum-= di
If sum> 0:
Continue
Cluster_centers [I] = copy (points [j])
Break

For p in points:
P. group = nearest_cluster_center (p, cluster_centers) [0]

'''
Points is a data point, and nclusters is the specified number of cluster classes.
'''
Def lloyd (points, nclusters ):
Cluster_centers = [Point () for _ in xrange (nclusters)] # Initialize the center Point based on the number of specified centers, all of which are (0, 0, 0)

# Call k ++ init
Kpp (points, cluster_centers) # select the initial seed point

# Kmeans
Lenpts10 = len (points)> 10

Changed = 0
While True:
# Group element for centroids are used as counters
For cc in cluster_centers:
Cc. x = 0
Cc. y = 0
Cc. group = 0

For p in points:
Cluster_centers [p. group]. group + = 1 # Number of data points in the same cluster as the seed point
Cluster_centers [p. group]. x + = p. x
Cluster_centers [p. group]. y + = p. y

For cc in cluster_centers: # generate a new center point
Cc. x/= cc. group
Cc. y/= cc. group

# Find closest centroid of each PointPtr
Changed = 0 # record the number of data points that change the cluster
For p in points:
Min_ I = nearest_cluster_center (p, cluster_centers) [0]
If min_ I! = P. group:
Changed + = 1
P. group = min_ I

# Stop when 99.9% of points are good
If changed <= lenpts10:
Break

For I, cc in enumerate (cluster_centers ):
Cc. group = I

Return cluster_centers

Def print_eps (points, cluster_centers, W = 400, H = 400 ):
Color = namedtuple ("Color", "r g B ");

Colors = []
For I in xrange (len (cluster_centers )):
Colors. append (Color (3 * (I + 1) % 11)/11.0,
(7 * I % 11)/11.0,
(9 * I % 11)/11.0 ))

Max_x = max_y =-FLOAT_MAX
Min_x = min_y = FLOAT_MAX

For p in points:
If max_x <p. x: max_x = p. x
If min_x> p. x: min_x = p. x
If max_y <p. y: max_y = p. y
If min_y> p. y: min_y = p. y

Scale = min (W/(max_x-min_x ),
H/(max_y-min_y ))
Cx = (max_x + min_x)/2
Cy = (max_y + min_y)/2

Print "%! PS-Adobe-3.0 \ n %%%% BoundingBox:-5-5% d % d "% (W + 10, H + 10)

Print ("/l {rlineto} def/m {rmoveto} def \ n" +
"/C {. 25 sub exch. 25 sub exch. 5 0 360 arc fill} def \ n" +
"/S {moveto-2 0 m 2 2 l 2-2 l-2-2 l closepath" +
"Gsave 1 setgray fill grestore gsave 3 setlinewidth" +
"1 setgray stroke grestore 0 setgray stroke} def ")

For I, cc in enumerate (cluster_centers ):
Print ("% g setrgbcolor" %
(Colors [I]. r, colors [I]. g, colors [I]. B ))

For p in points:
If p. group! = I:
Continue
Print ("%. 3f %. 3f c" % (p. x-cx) * scale + W/2,
(P. y-cy) * scale + H/2 ))

Print ("\ n0 setgray % g s" % (cc. x-cx) * scale + W/2,
(Cc. y-cy) * scale + H/2 ))

Print "\ n % EOF"

Def main ():
Npoints = 30000
K = 7 # clusters

Points = generate_points (npoints, 10)
Cluster_centers = Loyd (points, k)
Print_eps (points, cluster_centers)

Main ()

The algorithm implemented by the above code is for two-dimensional data, so the Point object has three attributes: values on the x axis, values on the y axis, and the ID of the cluster. The lloyd function is the overall implementation of the kmeans ++ algorithm. First, it selects appropriate seed points through the kpp function, and then implements the kmeans algorithm for clustering the dataset. The implementation of kpp functions fully complies with steps 2, 3, and 4 of the above basic concepts of kmeans ++.


4. matlab kmeans ++

The code is as follows:


Function [L, C] = kmeanspp (X, k)
% KMEANS Cluster multivariate data using the k-means ++ algorithm.
% [L, C] = kmeans_pp (X, k) produces a 1-by-size (X, 2) vector L with one class
% Label per column in X and a size (X, 1)-by-k matrix C containing
% Centers corresponding to each class.

% Version: 2013-02-08
% Authors: Laurent Sorber (Laurent.Sorber@cs.kuleuven.be)

L = [];
L1 = 0;

While length (unique (L ))~ = K

% The k-means ++ initialization.
C = X (:, 1 + round (rand * (size (X, 2)-1); % size (X, 2) is the number of data points in the data set X, and C is the set of centers.
L = ones (1, size (X, 2 ));
For I = 2: k
D = X-C (:, L); %-1
D = cumsum (sqrt (dot (D, D, 1); % accumulate the distance between each data point and the center point
If D (end) = 0, C (:, I: k) = X (:, ones (1, k-I + 1); return; end
C (:, I) = X (:, find (rand <D/D (end), 1); % the second parameter of find indicates the number of returned indexes
[~, L] = max (bsxfun (@ minus, 2 * real (c' * X), dot (C, C, 1 ). '); % bunker, which classifies each data point.
End

% The k-means algorithm.
While any (L ~ = L1)
L1 = L;
For I = 1: k, l = L = I; C (:, I) = sum (X (:, l), 2)/sum (l); end
[~, L] = max (bsxfun (@ minus, 2 * real (c' * X), dot (C, C, 1). '), [], 1 );
End

End

The implementation of this function is somewhat special. the parameter X is a dataset, but each column is regarded as a data point. the parameter k is the specified number of clusters. Return value L indicates the classification of each data point. return value C stores the final central point (a column represents a central point ). Test:

The code is as follows:


> X = [randn (3, 2) *. 4; randn (4, 2) *. 5 + ones (4, 1) * [4 4]
X =
-0.0497 0.5669
0.5959 0.2686
0.5636-0.4830
4.3586 4.3634
4.8151 3.8483
4.2444 4.1469
4.5173 3.6064

> [L, C] = kmeanspp (x', 2)
L =
2 2 2 1 1 1 1
C =
4.4839 0.3699
3.9913 0.1175

Now, let's start to understand this implementation a little bit and consolidate the knowledge of matlab.

The unique function is used to obtain different values in a matrix. for example:

The code is as follows:


> Unique ([1 3 3 4 4 5])
Ans =
1 3 4 5
> Unique ([1 3 3; 4 4 5])
Ans =
1
3
4
5


So the loop while length (unique (L ))~ = K to get k clustering as the end condition, but in general, this loop ends once, because the focus is in this loop.

Rand is a random number returned in the range (0, 1. In the row where annotation %-1 is located, C is expanded. the expanded method is similar to the following:

The code is as follows:


> C = [];
> C (1, 1) = 1
C =
1
> C (2, 1) = 2
C =
1
2
> C (:, [1 1 1])
Ans =
1 1 1 1
2 2 2 2
> C (:, [1 1 1 2])
Index exceeds matrix dimensions.


Element 1 of the second parameter in C actually represents the first column of data in C. the error "Index exceeds matrix dimensions." occurs when the value is 2 because C does not have the second column. If column C has the second column:

The code is as follows:


> C (2, 2) = 3;
> C (2, 2) = 4;
> C (:, [1 1 1 2])
Ans =
1 1 1 3
2 2 2 2 4


The dot function is to multiply two matrix vertices and then add the results to a certain dimension:

The code is as follows:


> TT = [1 2 3; 4 5 6];
> Dot (TT, TT)
Ans =
17 29 45
> Dot (TT, TT, 1)
Ans =
17 29 45


cumsumIs the accumulation function:

The code is as follows:


> Cumsum ([1 2 3])
Ans =
1 3 6
> Cumsum ([1 2 3; 4 5 6])
Ans =
1 2 3
5 7 9


The max function returns two values. The second value represents the index location of the max number:

The code is as follows:


> [~, L] = max ([1 2 3])
L =
3
> [~, L] = max ([1 2 3; 2 3 4])
L =
2 2 2


Where ~ Is a placeholder.

For bsxfun functions, the official documentation points out:

The code is as follows:


C = bsxfun (fun, A, B) applies the element-by-element binary operation specified by the function handle fun to arrays A and B, with singleton expansion enabled


The parameter fun is the function handle. for details about the function handle, refer to [9]. The following is an example of bsxfun:

The code is as follows:

> A = [1 2 3; 2 3 4]
A =
1 2 3
2 3 4
> B = [6; 7]
B =
6
7
> Bsxfun (@ minus, A, B)
Ans =
-5-4-3
-5-4-3


For:

The code is as follows:


[~, L] = max (bsxfun (@ minus, 2 * real (c' * X), dot (C, C, 1 ).'));


The max parameter is such a matrix. the matrix has n columns, and n is the number of data points. Each column represents the opposite number of distances between the corresponding data points and the centers. However, this distance is somewhat different, and it is considered a deformation of Euclidean distance.

Assume that the data points are two-dimensional. if a data point is (x1, y1) and a central point is (c1, d1), then bsxfun (@ minus, 2 real (c' X), dot (C, C, 1 ). the distance between the data point and the center point is 2c1x1 + 2d1y1-c1. ^ 2-c2. ^ 2, which can be changed to x1. ^ 2 + y1. ^ 2-(c1-x1 ). ^ 2-(d1-y1 ). ^ 2. For each column, x1. ^ 2 + y1. ^ 2 can be ignored because it is calculated between the data point and each center point. The final calculation result is the opposite number of the square of Euclidean distance. This also demonstrates the rationality of using max, because the cluster of a data point depends on the closest center point. if the distance is the opposite, it should be the point with the largest value.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.