Machine Learning Combat Bymatlab (iii) K-means algorithm

Source: Internet
Author: User

K-means algorithm belongs to unsupervised learning clustering algorithm, its calculation steps are quite simple, the thought is quite easy to understand, but also can realize the idea of EM algorithm in the thought.

Advantages and disadvantages of the K-means algorithm:

1. Advantages: Easy to achieve
2. Cons: May converge to local minimum, slow convergence on large data sets

Working with Data types: numeric data

Previous regression algorithms, Naive Bayes, SVM, and so on have the category tag y, and therefore belong to supervised learning, and K-means Clustering algorithm only x, no Y

In the clustering problem, our training sample is


Each of these Xi is an n-dimensional real number.

The sample data does not have the Y,k-means algorithm is the sample clustering into K clusters, the specific algorithm is as follows:
1, randomly selected K cluster centroid points, recorded as


2, repeat the following process until convergence

{
For each sample I, calculate which class it should belong to:


For each class J, recalculate the centroid:


}

where k is the number of clusters we have given beforehand, CI represents the nearest class in the sample I and K clusters, the value of CI is one from 1 to K, and centroid Uj represents our guess for the center of the sample that belongs to the same class. The explanation is that

The first step: in the sky we randomly pick K stars as the centroid of the cluster, and then for each star I, we calculate it to each centroid uj distance, select the shortest distance of the cluster as CI, so that the first step each star has its own belongs to the cluster;

Step two: For each star cluster CI, we recalculate its centroid UJ (calculated as averaging the coordinates of all points belonging to the cluster) and repeat the first and second steps until the centroid changes are small or constant.

And then the question comes, how do you calculate the centroid changes are small or constant? Or how do you judge that? The answer is the distortion function (distortion functions), which is defined as follows:


The J function represents the sum of the squares of each sample point to its centroid, and the convergence of the K-means is to minimize j, assuming that the current J value does not reach the minimum value, then the centroid Uj of each class can be fixed, and the J function is reduced when the category Ci for each sample is adjusted. Similarly, fixed Ci, adjust the centroid of each class Uj can also be J reduction. These two processes are the process of making J monotonically smaller in the inner loop. When J is reduced to a minimum, both U and C converge simultaneously. (The process is actually quite similar to the EM algorithm) it is theoretically possible that multiple sets of U and C make J get the minimum value, but this is actually very rare.

Since the distortion function J is a non-convex function, we cannot guarantee that the minimum value obtained must be the global minimum, which indicates that the selection of the initial position of the centroid of the K-means algorithm affects the acquisition of the last minimum value. However, in general, the local optimality of the K-means algorithm satisfies the requirements. If the unfortunate code falls into the local optimal, we can choose different initial values to run the K-means algorithm several times, then select the smallest J corresponding to the U and C output.

Another kind of convergence judgment:

When we actually write the code, we can also judge whether the cluster has been convergent by judging whether the centroid of each point is changed.

The above-mentioned distortion function can be used to evaluate the effect of convergence, which will be reflected in the following example.

Matlab implementation
 function kmeans Clccleark =4;d ataset = Load (' TestSet.txt ');[Row,col]=size(DataSet);% storage centroid MatrixCentset =Zeros(K,col);% randomly initialized centroid for I=1: Col minv = min (DataSet (:,I)); RANGV = Max (DataSet (:,I))-MINV; Centset (:,I) =Repmat(MINV,[K,1]) + rangv*RandK1);End% is used to store the cluster assigned to each point and the distance to the centroidClusterassment =Zeros(Row,2); clusterchange = true; whileClusterchange Clusterchange = false;% compute each point should be assigned the cluster     for I=1: Row% This part may be optimizedMindist =10000; Minindex =0; for J=1: K distcal = Disteclud (DataSet (I,:), Centset (J,:));if(Distcal < mindist) mindist = distcal; Minindex =J;End        End        ifMinindex ~= Clusterassment (I,1) Clusterchange = true;EndClusterassment (I,1) = Minindex; Clusterassment (I,2) = Mindist;End    % update centroid of each cluster     for J=1: K Simplecluster =Find(Clusterassment (:,1) ==J); Centset (J,:) = Mean (DataSet (Simplecluster ',:));EndEndFigure%scatter (DataSet (:, 1), DataSet (:, 2), 5) for I=1: K Pointcluster =Find(Clusterassment (:,1) ==I); Scatter (DataSet (Pointcluster,1), DataSet (Pointcluster,2),5) onEnd%hold onScatter (Centset (:,1), Centset (:,2), -,' + ') Hold offEnd% Euclidean distance calculation function dist = disteclud(VECA,VECB) Dist =sqrt(Sum (Power (VECA-VECB),2)));End

The effect is as follows:

This is the case of normal classification, it is clearly divided into 4 classes, different colors represent different classes, the centroid of cluster is "+"


Of course, this is just one of those situations where we are likely to have the following:


This is one of the drawbacks of K-means, the choice of random initial points may cause the algorithm to fall into the local optimal solution, we just need to rerun the program.

As for every seemingly normal cluster, we use the "distortion function" described above to measure the effect of clustering, of course, the better the J-Yue get-together class effect.

When actually used, we only need to run the program multiple times and choose J Minimum clustering effect.

Machine Learning Combat Bymatlab (iii) K-means algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.