Before we implemented the K-means algorithm, we mentioned the flaw in itself:
1. May converge to local minimum value
2. Slow convergence on large data sets
At the end of the last blog post, when the local minimum is caught, the processing method is to run the K-means algorithm several times, then select the Distortion function J as the best clustering result. This is obviously not acceptable to us, and we should be looking for a close-to-optimal clustering result at a time.
In fact, the root cause of K-means's shortcomings is: the initial selection of K centroid is more sensitive. A poor selection of centroid is likely to fall into the local minimum.
Based on the above situation, some two-K-means algorithm is proposed to solve this situation, that is, the effect of weakening the initial centroid selection on the final clustering effect.
Two-point K-means algorithm
Before we introduce the binary K-means algorithm, we explain a definition: SSE (Sum of squared error), which is the squared sum of errors, which is an indicator used to measure the clustering effect. In fact, SSE is what we call the distortion function in the K-means algorithm:
SSE calculates the squared difference between each point in a cluster and the centroid, and it can measure the quality of the cluster. Obviously, the smaller the SSE, the better the clustering effect.
The main idea of the binary K-means algorithm:
First, all points are used as a cluster, and then the cluster is divided into split. Then select the clusters that can minimize the clustering cost function (that is, the sum of squared errors) is divided into two clusters. This continues until the number of clusters is equal to the number of users given by K.
The pseudo-code for the binary K-mean algorithm is as follows:
将所有数据点看成一个簇 当簇数目小于k时 对每一个簇 计算总误差 在给定的簇上面进行k-均值聚类(k=2) 计算将该簇一分为二后的总误差 选择使得误差最小的那个簇进行划分操作
Matlab implementation
function bikmeans %%Clcclearclose All%%BiK =4; bidataset = Load (' TestSet.txt ');[Row,col]=size(Bidataset);% storage centroid MatrixBicentset =Zeros(Bik,col);% Initialization setting cluster number is 1Numcluster =1;% The first column stores the centroid assigned by each point, and the second column stores the distance from the point to the centroidBiclusterassume =Zeros(Row,2);% Initialize centroidBicentset (1,:) = mean (Bidataset) for I=1: Row Biclusterassume (I,1) = Numcluster; Biclusterassume (I,2) = Disteclud (Bidataset (I,:), Bicentset (1,:));End whileNumcluster < BiK Minsse =10000;% looking for which cluster to divide the best, that is, to find the smallest SSE cluster for J=1: Numcluster Curcluster = Bidataset (Find(Biclusterassume (:,1) ==J),:);[Spiltcentset,spiltclusterassume]= Kmeans (Curcluster,2); Spiltsse = SUM (Spiltclusterassume (:,2)); Nospiltsse = SUM (Biclusterassume (Find(Biclusterassume (:,1)~=J),2)); Cursse = Spiltsse + Nospiltsse; fprintf (the error of '%d cluster was divided:%f \ n ',[J, Cursse])if(Cursse < Minsse) Minsse = Cursse; Bestclustertospilt =J; Bestclusterassume = Spiltclusterassume; Bestcentset = Spiltcentset;End EndBestclustertospilt Bestcentset% updated number of clusterNumcluster = Numcluster +1; Bestclusterassume (Find(Bestclusterassume (:,1) ==1),1) = Bestclustertospilt; Bestclusterassume (Find(Bestclusterassume (:,1) ==2),1) = Numcluster;% Update and add centroid coordinatesBicentset (bestclustertospilt,:) = Bestcentset (1,:); Bicentset (numcluster,:) = Bestcentset (2,:); Bicentset% update the centroid distribution and error of each point of the cluster dividedBiclusterassume (Find(Biclusterassume (:,1) = = bestclustertospilt),:) = Bestclusterassume;EndFigure%scatter (DataSet (:, 1), DataSet (:, 2), 5) for I=1: BiK Pointcluster =Find(Biclusterassume (:,1) ==I); Scatter (Bidataset (Pointcluster,1), Bidataset (Pointcluster,2),5) onEnd%hold onScatter (Bicentset (:,1), Bicentset (:,2), -,' + ') Hold offEnd% Euclidean distance calculation function dist = disteclud(VECA,VECB) dist = SUM (Power (VECA-VECB),2));End% K-means Algorithm function [centset,clusterassment] = Kmeans(dataset,k) [Row,col]=size(DataSet);% storage centroid MatrixCentset =Zeros(K,col);% randomly initialized centroid for I=1: Col minv = min (DataSet (:,I)); RANGV = Max (DataSet (:,I))-MINV; Centset (:,I) =Repmat(MINV,[K,1]) + rangv*RandK1);End% is used to store the cluster assigned to each point and the distance to the centroidClusterassment =Zeros(Row,2); clusterchange = true; whileClusterchange Clusterchange = false;% compute each point should be assigned the cluster for I=1: Row% This part may be optimizedMindist =10000; Minindex =0; for J=1: K distcal = Disteclud (DataSet (I,:), Centset (J,:));if(Distcal < mindist) mindist = distcal; Minindex =J;End End ifMinindex ~= Clusterassment (I,1) Clusterchange = true;EndClusterassment (I,1) = Minindex; Clusterassment (I,2) = Mindist;End % update centroid of each cluster for J=1: K Simplecluster =Find(Clusterassment (:,1) ==J); Centset (J,:) = Mean (DataSet (Simplecluster ',:));EndEndEnd
The iterative process of the algorithm is as follows
Bicentset =
-0.1036 0.0543 0 0 0 0 0 0
The error after the 1th cluster is divided is: 792.916857
Bestclustertospilt =
1
Bestcentset =
-0.2897 -2.8394 0.0825 2.9480
Bicentset =
-0.2897 -2.8394 0.0825 2.9480 0 0 0 0
The error after the 1th cluster is divided is: 409.871545
The error after the 2nd cluster is divided is: 532.999616
Bestclustertospilt =
1
Bestcentset =
-3.3824 -2.9473 2.8029 -2.7315
Bicentset =
-3.3824 -2.9473 0.0825 2.9480 2.8029 -2.7315 0 0
The error after the 1th cluster is divided is: 395.669052
The error after the 2nd cluster is divided is: 149.954305
The error after the 3rd cluster is divided is: 393.431098
Bestclustertospilt =
2
Bestcentset =
2.6265 3.1087-2.4615 2.7874
Bicentset =
-3.3824 -2.9473 2.6265 3.1087 2.8029 -2.7315 -2.4615 2.7874
Eventually
When using the binary K-means algorithm for clustering, the results of different initial centroid clusters will be slightly different, because in fact it is only the effect of weakening the random centroid on the clustering result, it can not eliminate its influence, but it can eventually converge to the global minimum.
Machine learning Combat Bymatlab (iv) binary K-means algorithm