Machine learning Combat Bymatlab (iv) binary K-means algorithm

Source: Internet
Author: User

Before we implemented the K-means algorithm, we mentioned the flaw in itself:

1. May converge to local minimum value
2. Slow convergence on large data sets

At the end of the last blog post, when the local minimum is caught, the processing method is to run the K-means algorithm several times, then select the Distortion function J as the best clustering result. This is obviously not acceptable to us, and we should be looking for a close-to-optimal clustering result at a time.

In fact, the root cause of K-means's shortcomings is: the initial selection of K centroid is more sensitive. A poor selection of centroid is likely to fall into the local minimum.

Based on the above situation, some two-K-means algorithm is proposed to solve this situation, that is, the effect of weakening the initial centroid selection on the final clustering effect.

Two-point K-means algorithm

Before we introduce the binary K-means algorithm, we explain a definition: SSE (Sum of squared error), which is the squared sum of errors, which is an indicator used to measure the clustering effect. In fact, SSE is what we call the distortion function in the K-means algorithm:


SSE calculates the squared difference between each point in a cluster and the centroid, and it can measure the quality of the cluster. Obviously, the smaller the SSE, the better the clustering effect.

The main idea of the binary K-means algorithm:
First, all points are used as a cluster, and then the cluster is divided into split. Then select the clusters that can minimize the clustering cost function (that is, the sum of squared errors) is divided into two clusters. This continues until the number of clusters is equal to the number of users given by K.

The pseudo-code for the binary K-mean algorithm is as follows:

将所有数据点看成一个簇    当簇数目小于k时      对每一个簇          计算总误差          在给定的簇上面进行k-均值聚类(k=2)          计算将该簇一分为二后的总误差      选择使得误差最小的那个簇进行划分操作
Matlab implementation
 function bikmeans %%Clcclearclose All%%BiK =4; bidataset = Load (' TestSet.txt ');[Row,col]=size(Bidataset);% storage centroid MatrixBicentset =Zeros(Bik,col);% Initialization setting cluster number is 1Numcluster =1;% The first column stores the centroid assigned by each point, and the second column stores the distance from the point to the centroidBiclusterassume =Zeros(Row,2);% Initialize centroidBicentset (1,:) = mean (Bidataset) for I=1: Row Biclusterassume (I,1) = Numcluster; Biclusterassume (I,2) = Disteclud (Bidataset (I,:), Bicentset (1,:));End whileNumcluster < BiK Minsse =10000;% looking for which cluster to divide the best, that is, to find the smallest SSE cluster     for J=1: Numcluster Curcluster = Bidataset (Find(Biclusterassume (:,1) ==J),:);[Spiltcentset,spiltclusterassume]= Kmeans (Curcluster,2); Spiltsse = SUM (Spiltclusterassume (:,2)); Nospiltsse = SUM (Biclusterassume (Find(Biclusterassume (:,1)~=J),2));        Cursse = Spiltsse + Nospiltsse; fprintf (the error of '%d cluster was divided:%f \ n ',[J, Cursse])if(Cursse < Minsse) Minsse = Cursse; Bestclustertospilt =J;            Bestclusterassume = Spiltclusterassume; Bestcentset = Spiltcentset;End    EndBestclustertospilt Bestcentset% updated number of clusterNumcluster = Numcluster +1; Bestclusterassume (Find(Bestclusterassume (:,1) ==1),1) = Bestclustertospilt; Bestclusterassume (Find(Bestclusterassume (:,1) ==2),1) = Numcluster;% Update and add centroid coordinatesBicentset (bestclustertospilt,:) = Bestcentset (1,:); Bicentset (numcluster,:) = Bestcentset (2,:); Bicentset% update the centroid distribution and error of each point of the cluster dividedBiclusterassume (Find(Biclusterassume (:,1) = = bestclustertospilt),:) = Bestclusterassume;EndFigure%scatter (DataSet (:, 1), DataSet (:, 2), 5) for I=1: BiK Pointcluster =Find(Biclusterassume (:,1) ==I); Scatter (Bidataset (Pointcluster,1), Bidataset (Pointcluster,2),5) onEnd%hold onScatter (Bicentset (:,1), Bicentset (:,2), -,' + ') Hold offEnd% Euclidean distance calculation function dist = disteclud(VECA,VECB) dist = SUM (Power (VECA-VECB),2));End% K-means Algorithm function [centset,clusterassment] = Kmeans(dataset,k) [Row,col]=size(DataSet);% storage centroid MatrixCentset =Zeros(K,col);% randomly initialized centroid for I=1: Col minv = min (DataSet (:,I)); RANGV = Max (DataSet (:,I))-MINV; Centset (:,I) =Repmat(MINV,[K,1]) + rangv*RandK1);End% is used to store the cluster assigned to each point and the distance to the centroidClusterassment =Zeros(Row,2); clusterchange = true; whileClusterchange Clusterchange = false;% compute each point should be assigned the cluster     for I=1: Row% This part may be optimizedMindist =10000; Minindex =0; for J=1: K distcal = Disteclud (DataSet (I,:), Centset (J,:));if(Distcal < mindist) mindist = distcal; Minindex =J;End        End        ifMinindex ~= Clusterassment (I,1) Clusterchange = true;EndClusterassment (I,1) = Minindex; Clusterassment (I,2) = Mindist;End    % update centroid of each cluster     for J=1: K Simplecluster =Find(Clusterassment (:,1) ==J); Centset (J,:) = Mean (DataSet (Simplecluster ',:));EndEndEnd
The iterative process of the algorithm is as follows

Bicentset =

-0.1036    0.0543     0         0     0         0     0         0

The error after the 1th cluster is divided is: 792.916857

Bestclustertospilt =

 1

Bestcentset =

   -0.2897   -2.8394    0.0825    2.9480

Bicentset =

   -0.2897   -2.8394    0.0825    2.9480     0         0     0         0

The error after the 1th cluster is divided is: 409.871545
The error after the 2nd cluster is divided is: 532.999616

Bestclustertospilt =

 1

Bestcentset =

   -3.3824   -2.9473    2.8029   -2.7315

Bicentset =

   -3.3824   -2.9473    0.0825    2.9480    2.8029   -2.7315     0         0

The error after the 1th cluster is divided is: 395.669052
The error after the 2nd cluster is divided is: 149.954305
The error after the 3rd cluster is divided is: 393.431098

Bestclustertospilt =

 2

Bestcentset =

2.6265    3.1087-2.4615    2.7874

Bicentset =

   -3.3824   -2.9473    2.6265    3.1087    2.8029   -2.7315   -2.4615    2.7874
Eventually


When using the binary K-means algorithm for clustering, the results of different initial centroid clusters will be slightly different, because in fact it is only the effect of weakening the random centroid on the clustering result, it can not eliminate its influence, but it can eventually converge to the global minimum.

Machine learning Combat Bymatlab (iv) binary K-means algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.