Machine Learning Combat Bymatlab (iii) K-means algorithm

Last Update:2015-04-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

K-means algorithm belongs to unsupervised learning clustering algorithm, its calculation steps are quite simple, the thought is quite easy to understand, but also can realize the idea of EM algorithm in the thought.

Advantages and disadvantages of the K-means algorithm:

1. Advantages: Easy to achieve
2. Cons: May converge to local minimum, slow convergence on large data sets

Working with Data types: numeric data

Previous regression algorithms, Naive Bayes, SVM, and so on have the category tag y, and therefore belong to supervised learning, and K-means Clustering algorithm only x, no Y

In the clustering problem, our training sample is

Each of these Xi is an n-dimensional real number.

The sample data does not have the Y,k-means algorithm is the sample clustering into K clusters, the specific algorithm is as follows:
1, randomly selected K cluster centroid points, recorded as

2, repeat the following process until convergence

{
For each sample I, calculate which class it should belong to:

For each class J, recalculate the centroid:

}

where k is the number of clusters we have given beforehand, CI represents the nearest class in the sample I and K clusters, the value of CI is one from 1 to K, and centroid Uj represents our guess for the center of the sample that belongs to the same class. The explanation is that

The first step: in the sky we randomly pick K stars as the centroid of the cluster, and then for each star I, we calculate it to each centroid uj distance, select the shortest distance of the cluster as CI, so that the first step each star has its own belongs to the cluster;

Step two: For each star cluster CI, we recalculate its centroid UJ (calculated as averaging the coordinates of all points belonging to the cluster) and repeat the first and second steps until the centroid changes are small or constant.

And then the question comes, how do you calculate the centroid changes are small or constant? Or how do you judge that? The answer is the distortion function (distortion functions), which is defined as follows:

The J function represents the sum of the squares of each sample point to its centroid, and the convergence of the K-means is to minimize j, assuming that the current J value does not reach the minimum value, then the centroid Uj of each class can be fixed, and the J function is reduced when the category Ci for each sample is adjusted. Similarly, fixed Ci, adjust the centroid of each class Uj can also be J reduction. These two processes are the process of making J monotonically smaller in the inner loop. When J is reduced to a minimum, both U and C converge simultaneously. (The process is actually quite similar to the EM algorithm) it is theoretically possible that multiple sets of U and C make J get the minimum value, but this is actually very rare.

Since the distortion function J is a non-convex function, we cannot guarantee that the minimum value obtained must be the global minimum, which indicates that the selection of the initial position of the centroid of the K-means algorithm affects the acquisition of the last minimum value. However, in general, the local optimality of the K-means algorithm satisfies the requirements. If the unfortunate code falls into the local optimal, we can choose different initial values to run the K-means algorithm several times, then select the smallest J corresponding to the U and C output.

Another kind of convergence judgment:

When we actually write the code, we can also judge whether the cluster has been convergent by judging whether the centroid of each point is changed.

The above-mentioned distortion function can be used to evaluate the effect of convergence, which will be reflected in the following example.

Matlab implementation

 function kmeans Clccleark =4;d ataset = Load (' TestSet.txt ');[Row,col]=size(DataSet);% storage centroid MatrixCentset =Zeros(K,col);% randomly initialized centroid for I=1: Col minv = min (DataSet (:,I)); RANGV = Max (DataSet (:,I))-MINV; Centset (:,I) =Repmat(MINV,[K,1]) + rangv*RandK1);End% is used to store the cluster assigned to each point and the distance to the centroidClusterassment =Zeros(Row,2); clusterchange = true; whileClusterchange Clusterchange = false;% compute each point should be assigned the cluster     for I=1: Row% This part may be optimizedMindist =10000; Minindex =0; for J=1: K distcal = Disteclud (DataSet (I,:), Centset (J,:));if(Distcal < mindist) mindist = distcal; Minindex =J;End        End        ifMinindex ~= Clusterassment (I,1) Clusterchange = true;EndClusterassment (I,1) = Minindex; Clusterassment (I,2) = Mindist;End    % update centroid of each cluster     for J=1: K Simplecluster =Find(Clusterassment (:,1) ==J); Centset (J,:) = Mean (DataSet (Simplecluster ',:));EndEndFigure%scatter (DataSet (:, 1), DataSet (:, 2), 5) for I=1: K Pointcluster =Find(Clusterassment (:,1) ==I); Scatter (DataSet (Pointcluster,1), DataSet (Pointcluster,2),5) onEnd%hold onScatter (Centset (:,1), Centset (:,2), -,' + ') Hold offEnd% Euclidean distance calculation function dist = disteclud(VECA,VECB) Dist =sqrt(Sum (Power (VECA-VECB),2)));End

The effect is as follows:

This is the case of normal classification, it is clearly divided into 4 classes, different colors represent different classes, the centroid of cluster is "+"

Of course, this is just one of those situations where we are likely to have the following:

This is one of the drawbacks of K-means, the choice of random initial points may cause the algorithm to fall into the local optimal solution, we just need to rerun the program.

As for every seemingly normal cluster, we use the "distortion function" described above to measure the effect of clustering, of course, the better the J-Yue get-together class effect.

When actually used, we only need to run the program multiple times and choose J Minimum clustering effect.

Machine Learning Combat Bymatlab (iii) K-means algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Machine Learning Combat Bymatlab (iii) K-means algorithm

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support