Machine learning: Kmeans

Source: Internet
Author: User

??

Introduction

??

K-means Very early contact, senior undergraduate to do when the K-means, recently from the new to machine learning Combat book, and then combined with the relevant articles in the past few years to see, talk about Kmeans

??

Algorithmic flow

??

First, each sample vector in the dataset can be thought of as a point in the high-dimensional space

??

So we can start with a random selection of K data points from the dataset as the initial class center, or you can create a K centroid that conforms to the range of the dataset, and note that the K centroid here may not be a real K Point ( machine learning in combat is randomly generated in the range of the data set K centroid ), because then the centroid will be recalculated, so there is no

??

Each data point is then assigned to the nearest class closest to it, forming a k -cluster, where the distance from the rest of the center of the class is calculated.

??

Recalculate the class center for each cluster

??

Until the cluster does not change or the maximum number of iterations has been reached

??

Complexity of the algorithm

??

Time complexity:O (TKMN)---t is iteration count,K is the number of clusters,n is the number of samples, andm is the number of dimensions

Space complexity:O (nm)

??

General T,k,m can be considered constant, so time and space complexity can be simplified to O (n), i.e. linear

??

Algorithm implementation

??

The first is the random generation of K initialization class centers

??

def randcent (DataSet, K):

n = shape (DataSet) [1]

Centroids = Mat (Zeros ((k,n))) #create centroid Mat

For j in Range (N): #create the random cluster centers, within bounds of each dimension

Minj = min (Dataset[:,j])

Rangej = float (max (dataset[:,j])-Minj)

CENTROIDS[:,J] = Mat (Minj + Rangej * Random.rand (k,1))

Return centroids

??

The function is to randomly generate a value between the maximum and minimum values on each dimension as a numeric value on that dimension, and each dimension produces a number of values that form a centroid point, generating a total of K centroid points

??

And then there's the distribution of each point.

??

A clusterchangedis defined in the main function of the Kmeans , initialized to true, as long as the class is still changing, iterating until the class no longer changes

??

Each sample I in the m sample set is cycled, the sample I and the K centroid are given a distance, the centroid of the minimum distance is found, and the sample is assigned to the class where the centroid resides

??

Here the program uses a clusterassment matrix, a matrix of mx2 stores the class of m samples and the distance from the sample to that centroid

??

Recalculate centroid

??

When you're done allocating, recalculate the centroid

??

def Kmeans (DataSet, K, Distmeas=disteclud, createcent=randcent):

m = shape (DataSet) [0]

Clusterassment = Mat (Zeros ((m,2)) #create mat to assign data points

#to a centroid, also holds SE of each point

Centroids = Createcent (DataSet, K)

clusterchanged = True

While clusterchanged:

clusterchanged = False

For I in range (m): #for Each data point assign it to the closest centroid

Mindist = inf; Minindex =-1

For j in Range (K):

Distji = Distmeas (Centroids[j,:],dataset[i,:])

If Distji < mindist:

Mindist = Distji; Minindex = J

If clusterassment[i,0]! = minindex:clusterchanged = True

Clusterassment[i,:] = minindex,mindist**2

Print Centroids

For cent in range (k): #recalculate centroids

Ptsinclust = Dataset[nonzero (clusterassment[:,0]. A==cent) [0]] #get all the "this cluster

Centroids[cent,:] = mean (Ptsinclust, axis=0) #assign centroid to mean

Return centroids, Clusterassment

??

Measure the quality of clustering

??

A measure clustering result good or bad index is SSE, error squared sum, that is, before the clusterassment matrix of each sample to the centroid of the distance, the smaller the SSE, representing the data points closer to their centroid, clustering results better

??

Methods for determining K -values

??

Multiple run attempts to find the minimum k value of SSE

??

A common approach is to run multiple times, each time using a different set of random initial centroid, and then select clusters with minimum SSE(squared error sum) >>> Simple but may not be effective

??

in practical applications, due to theKmeangenerally as data preprocessing, or for secondary classification labeling. Sokgenerally not set very large. You can use enumerations to makekfrom2to a fixed value such asTen, in eachkrepeatedly run several times on a valueKmeans (Avoid local optimal solutions), and calculates the currentkof theSSE, and finally selectSSEMinimumcorresponds to the value of thekas the final number of clusters.

??

The combination of hierarchical clustering

??

First, hierarchical clustering algorithm is used to determine the approximate number of clusters in the result, and an initial clustering is found, and then iterative relocation is used to improve the clustering

??

The case is (1) The sample is relatively small such as hundreds of to thousands of, mainly hierarchical clustering overhead (2) k relative to the sample size is small

??

Initial partitioning using the canopy algorithm

??

Canopy clustering in the first phase of the selection of simple, low computational cost method to calculate the object similarity, the similar objects in a subset, this subset is called Canopy , through a series of calculations to get several Canopy

??

According to the number of Canopy , we can infer the value of K roughly and avoid the blindness of K .

??

Canopy algorithm

??

Initially, we have a set of point sets S, preset two distance thresholds, t1>t2

??

Select a point Pto calculate its distance from the other points in S (where a low cost calculation is used) and use this point as the centroid of this canopy

??

Place the point within the canopy with a p distance of T1 and delete the points in s that are within the T2 with this point P

??

This is to ensure that the point within the T2 of the center P is no longer the center of the other Canopy.

??

Select the new canopy centroid from the remaining points in S, and the last point will form the following

??

??

Processing of empty Clustering

??

If all points are not assigned to a cluster during the assignment process, they will get an empty cluster

??

In this case the replacement centroid is required, otherwise the squared error will be large

??

One way is to select a point farthest from any current centroid, which will eliminate the point that currently has the greatest impact on the total squared error

??

Another method is to select a substitute centroid from a cluster with a maximum squared error, which splits the cluster and reduces the total squared error of the cluster

??

Scope of application

??

The difference between cluster and cluster is obvious, and the result is ideal when cluster size is similar.

??

Because time complexity is tkmn and linearly correlated with the number of samples, the algorithm is efficient and scalable for processing large data sets.

??

But the algorithm is sensitive to K and the initial clustering center, and often ends with the local optimal.

??

Simultaneously sensitive to noise and isolation points

Machine learning: Kmeans

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.