Machine learning: Kmeans

Last Update:2015-04-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction

K-means Very early contact, senior undergraduate to do when the K-means, recently from the new to machine learning Combat book, and then combined with the relevant articles in the past few years to see, talk about Kmeans

Algorithmic flow

First, each sample vector in the dataset can be thought of as a point in the high-dimensional space

So we can start with a random selection of K data points from the dataset as the initial class center, or you can create a K centroid that conforms to the range of the dataset, and note that the K centroid here may not be a real K Point ( machine learning in combat is randomly generated in the range of the data set K centroid ), because then the centroid will be recalculated, so there is no

Each data point is then assigned to the nearest class closest to it, forming a k -cluster, where the distance from the rest of the center of the class is calculated.

Recalculate the class center for each cluster

Until the cluster does not change or the maximum number of iterations has been reached

Complexity of the algorithm

Time complexity:O (TKMN)---t is iteration count,K is the number of clusters,n is the number of samples, andm is the number of dimensions

Space complexity:O (nm)

General T,k,m can be considered constant, so time and space complexity can be simplified to O (n), i.e. linear

Algorithm implementation

The first is the random generation of K initialization class centers

def randcent (DataSet, K):

n = shape (DataSet) [1]

Centroids = Mat (Zeros ((k,n))) #create centroid Mat

For j in Range (N): #create the random cluster centers, within bounds of each dimension

Minj = min (Dataset[:,j])

Rangej = float (max (dataset[:,j])-Minj)

CENTROIDS[:,J] = Mat (Minj + Rangej * Random.rand (k,1))

Return centroids

The function is to randomly generate a value between the maximum and minimum values on each dimension as a numeric value on that dimension, and each dimension produces a number of values that form a centroid point, generating a total of K centroid points

And then there's the distribution of each point.

A clusterchangedis defined in the main function of the Kmeans , initialized to true, as long as the class is still changing, iterating until the class no longer changes

Each sample I in the m sample set is cycled, the sample I and the K centroid are given a distance, the centroid of the minimum distance is found, and the sample is assigned to the class where the centroid resides

Here the program uses a clusterassment matrix, a matrix of mx2 stores the class of m samples and the distance from the sample to that centroid

Recalculate centroid

When you're done allocating, recalculate the centroid

def Kmeans (DataSet, K, Distmeas=disteclud, createcent=randcent):

m = shape (DataSet) [0]

Clusterassment = Mat (Zeros ((m,2)) #create mat to assign data points

#to a centroid, also holds SE of each point

Centroids = Createcent (DataSet, K)

clusterchanged = True

While clusterchanged:

clusterchanged = False

For I in range (m): #for Each data point assign it to the closest centroid

Mindist = inf; Minindex =-1

For j in Range (K):

Distji = Distmeas (Centroids[j,:],dataset[i,:])

If Distji < mindist:

Mindist = Distji; Minindex = J

If clusterassment[i,0]! = minindex:clusterchanged = True

Clusterassment[i,:] = minindex,mindist**2

Print Centroids

For cent in range (k): #recalculate centroids

Ptsinclust = Dataset[nonzero (clusterassment[:,0]. A==cent) [0]] #get all the "this cluster

Centroids[cent,:] = mean (Ptsinclust, axis=0) #assign centroid to mean

Return centroids, Clusterassment

Measure the quality of clustering

A measure clustering result good or bad index is SSE, error squared sum, that is, before the clusterassment matrix of each sample to the centroid of the distance, the smaller the SSE, representing the data points closer to their centroid, clustering results better

Methods for determining K -values

Multiple run attempts to find the minimum k value of SSE

A common approach is to run multiple times, each time using a different set of random initial centroid, and then select clusters with minimum SSE(squared error sum) >>> Simple but may not be effective

in practical applications, due to theKmeangenerally as data preprocessing, or for secondary classification labeling. Sokgenerally not set very large. You can use enumerations to makekfrom2to a fixed value such asTen, in eachkrepeatedly run several times on a valueKmeans (Avoid local optimal solutions), and calculates the currentkof theSSE, and finally selectSSEMinimumcorresponds to the value of thekas the final number of clusters.

The combination of hierarchical clustering

First, hierarchical clustering algorithm is used to determine the approximate number of clusters in the result, and an initial clustering is found, and then iterative relocation is used to improve the clustering

The case is (1) The sample is relatively small such as hundreds of to thousands of, mainly hierarchical clustering overhead (2) k relative to the sample size is small

Initial partitioning using the canopy algorithm

Canopy clustering in the first phase of the selection of simple, low computational cost method to calculate the object similarity, the similar objects in a subset, this subset is called Canopy , through a series of calculations to get several Canopy

According to the number of Canopy , we can infer the value of K roughly and avoid the blindness of K .

Canopy algorithm

Initially, we have a set of point sets S, preset two distance thresholds, t1>t2

Select a point Pto calculate its distance from the other points in S (where a low cost calculation is used) and use this point as the centroid of this canopy

Place the point within the canopy with a p distance of T1 and delete the points in s that are within the T2 with this point P

This is to ensure that the point within the T2 of the center P is no longer the center of the other Canopy.

Select the new canopy centroid from the remaining points in S, and the last point will form the following

Processing of empty Clustering

If all points are not assigned to a cluster during the assignment process, they will get an empty cluster

In this case the replacement centroid is required, otherwise the squared error will be large

One way is to select a point farthest from any current centroid, which will eliminate the point that currently has the greatest impact on the total squared error

Another method is to select a substitute centroid from a cluster with a maximum squared error, which splits the cluster and reduces the total squared error of the cluster

Scope of application

The difference between cluster and cluster is obvious, and the result is ideal when cluster size is similar.

Because time complexity is tkmn and linearly correlated with the number of samples, the algorithm is efficient and scalable for processing large data sets.

But the algorithm is sensitive to K and the initial clustering center, and often ends with the local optimal.

Simultaneously sensitive to noise and isolation points

Machine learning: Kmeans

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Machine learning: Kmeans

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support