K-means algorithm belongs to unsupervised learning clustering algorithm, its calculation steps are quite simple, the thought is quite easy to understand, but also can realize the idea of EM algorithm in the thought.
Advantages and disadvantages of the K-means algorithm:
1. Advantages: Easy to achieve
2. Cons: May converge to local minimum, slow convergence on large data sets
Working with Data types: numeric data
Previous regression algorithms, Naive Bayes, SVM, and so on have the category tag y, and therefore belong to supervised learning, and K-means Clustering algorithm only x, no Y
In the clustering problem, our training sample is
Each of these Xi is an n-dimensional real number.
The sample data does not have the Y,k-means algorithm is the sample clustering into K clusters, the specific algorithm is as follows:
1, randomly selected K cluster centroid points, recorded as
2, repeat the following process until convergence
{
For each sample I, calculate which class it should belong to:
For each class J, recalculate the centroid:
}
where k is the number of clusters we have given beforehand, CI represents the nearest class in the sample I and K clusters, the value of CI is one from 1 to K, and centroid Uj represents our guess for the center of the sample that belongs to the same class. The explanation is that
The first step: in the sky we randomly pick K stars as the centroid of the cluster, and then for each star I, we calculate it to each centroid uj distance, select the shortest distance of the cluster as CI, so that the first step each star has its own belongs to the cluster;
Step two: For each star cluster CI, we recalculate its centroid UJ (calculated as averaging the coordinates of all points belonging to the cluster) and repeat the first and second steps until the centroid changes are small or constant.
And then the question comes, how do you calculate the centroid changes are small or constant? Or how do you judge that? The answer is the distortion function (distortion functions), which is defined as follows:
The J function represents the sum of the squares of each sample point to its centroid, and the convergence of the K-means is to minimize j, assuming that the current J value does not reach the minimum value, then the centroid Uj of each class can be fixed, and the J function is reduced when the category Ci for each sample is adjusted. Similarly, fixed Ci, adjust the centroid of each class Uj can also be J reduction. These two processes are the process of making J monotonically smaller in the inner loop. When J is reduced to a minimum, both U and C converge simultaneously. (The process is actually quite similar to the EM algorithm) it is theoretically possible that multiple sets of U and C make J get the minimum value, but this is actually very rare.
Since the distortion function J is a non-convex function, we cannot guarantee that the minimum value obtained must be the global minimum, which indicates that the selection of the initial position of the centroid of the K-means algorithm affects the acquisition of the last minimum value. However, in general, the local optimality of the K-means algorithm satisfies the requirements. If the unfortunate code falls into the local optimal, we can choose different initial values to run the K-means algorithm several times, then select the smallest J corresponding to the U and C output.
Another kind of convergence judgment:
When we actually write the code, we can also judge whether the cluster has been convergent by judging whether the centroid of each point is changed.
The above-mentioned distortion function can be used to evaluate the effect of convergence, which will be reflected in the following example.
Matlab implementation
function kmeans Clccleark =4;d ataset = Load (' TestSet.txt ');[Row,col]=size(DataSet);% storage centroid MatrixCentset =Zeros(K,col);% randomly initialized centroid for I=1: Col minv = min (DataSet (:,I)); RANGV = Max (DataSet (:,I))-MINV; Centset (:,I) =Repmat(MINV,[K,1]) + rangv*RandK1);End% is used to store the cluster assigned to each point and the distance to the centroidClusterassment =Zeros(Row,2); clusterchange = true; whileClusterchange Clusterchange = false;% compute each point should be assigned the cluster for I=1: Row% This part may be optimizedMindist =10000; Minindex =0; for J=1: K distcal = Disteclud (DataSet (I,:), Centset (J,:));if(Distcal < mindist) mindist = distcal; Minindex =J;End End ifMinindex ~= Clusterassment (I,1) Clusterchange = true;EndClusterassment (I,1) = Minindex; Clusterassment (I,2) = Mindist;End % update centroid of each cluster for J=1: K Simplecluster =Find(Clusterassment (:,1) ==J); Centset (J,:) = Mean (DataSet (Simplecluster ',:));EndEndFigure%scatter (DataSet (:, 1), DataSet (:, 2), 5) for I=1: K Pointcluster =Find(Clusterassment (:,1) ==I); Scatter (DataSet (Pointcluster,1), DataSet (Pointcluster,2),5) onEnd%hold onScatter (Centset (:,1), Centset (:,2), -,' + ') Hold offEnd% Euclidean distance calculation function dist = disteclud(VECA,VECB) Dist =sqrt(Sum (Power (VECA-VECB),2)));End
The effect is as follows:
This is the case of normal classification, it is clearly divided into 4 classes, different colors represent different classes, the centroid of cluster is "+"
Of course, this is just one of those situations where we are likely to have the following:
This is one of the drawbacks of K-means, the choice of random initial points may cause the algorithm to fall into the local optimal solution, we just need to rerun the program.
As for every seemingly normal cluster, we use the "distortion function" described above to measure the effect of clustering, of course, the better the J-Yue get-together class effect.
When actually used, we only need to run the program multiple times and choose J Minimum clustering effect.
Machine Learning Combat Bymatlab (iii) K-means algorithm