Machine learning Combat Bymatlab (iv) binary K-means algorithm

Last Update:2015-04-17 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Before we implemented the K-means algorithm, we mentioned the flaw in itself:

1. May converge to local minimum value
2. Slow convergence on large data sets

At the end of the last blog post, when the local minimum is caught, the processing method is to run the K-means algorithm several times, then select the Distortion function J as the best clustering result. This is obviously not acceptable to us, and we should be looking for a close-to-optimal clustering result at a time.

In fact, the root cause of K-means's shortcomings is: the initial selection of K centroid is more sensitive. A poor selection of centroid is likely to fall into the local minimum.

Based on the above situation, some two-K-means algorithm is proposed to solve this situation, that is, the effect of weakening the initial centroid selection on the final clustering effect.

Two-point K-means algorithm

Before we introduce the binary K-means algorithm, we explain a definition: SSE (Sum of squared error), which is the squared sum of errors, which is an indicator used to measure the clustering effect. In fact, SSE is what we call the distortion function in the K-means algorithm:

SSE calculates the squared difference between each point in a cluster and the centroid, and it can measure the quality of the cluster. Obviously, the smaller the SSE, the better the clustering effect.

The main idea of the binary K-means algorithm:
First, all points are used as a cluster, and then the cluster is divided into split. Then select the clusters that can minimize the clustering cost function (that is, the sum of squared errors) is divided into two clusters. This continues until the number of clusters is equal to the number of users given by K.

The pseudo-code for the binary K-mean algorithm is as follows:

将所有数据点看成一个簇    当簇数目小于k时      对每一个簇          计算总误差          在给定的簇上面进行k-均值聚类（k=2）          计算将该簇一分为二后的总误差      选择使得误差最小的那个簇进行划分操作

Matlab implementation

 function bikmeans %%Clcclearclose All%%BiK =4; bidataset = Load (' TestSet.txt ');[Row,col]=size(Bidataset);% storage centroid MatrixBicentset =Zeros(Bik,col);% Initialization setting cluster number is 1Numcluster =1;% The first column stores the centroid assigned by each point, and the second column stores the distance from the point to the centroidBiclusterassume =Zeros(Row,2);% Initialize centroidBicentset (1,:) = mean (Bidataset) for I=1: Row Biclusterassume (I,1) = Numcluster; Biclusterassume (I,2) = Disteclud (Bidataset (I,:), Bicentset (1,:));End whileNumcluster < BiK Minsse =10000;% looking for which cluster to divide the best, that is, to find the smallest SSE cluster     for J=1: Numcluster Curcluster = Bidataset (Find(Biclusterassume (:,1) ==J),:);[Spiltcentset,spiltclusterassume]= Kmeans (Curcluster,2); Spiltsse = SUM (Spiltclusterassume (:,2)); Nospiltsse = SUM (Biclusterassume (Find(Biclusterassume (:,1)~=J),2));        Cursse = Spiltsse + Nospiltsse; fprintf (the error of '%d cluster was divided:%f \ n ',[J, Cursse])if(Cursse < Minsse) Minsse = Cursse; Bestclustertospilt =J;            Bestclusterassume = Spiltclusterassume; Bestcentset = Spiltcentset;End    EndBestclustertospilt Bestcentset% updated number of clusterNumcluster = Numcluster +1; Bestclusterassume (Find(Bestclusterassume (:,1) ==1),1) = Bestclustertospilt; Bestclusterassume (Find(Bestclusterassume (:,1) ==2),1) = Numcluster;% Update and add centroid coordinatesBicentset (bestclustertospilt,:) = Bestcentset (1,:); Bicentset (numcluster,:) = Bestcentset (2,:); Bicentset% update the centroid distribution and error of each point of the cluster dividedBiclusterassume (Find(Biclusterassume (:,1) = = bestclustertospilt),:) = Bestclusterassume;EndFigure%scatter (DataSet (:, 1), DataSet (:, 2), 5) for I=1: BiK Pointcluster =Find(Biclusterassume (:,1) ==I); Scatter (Bidataset (Pointcluster,1), Bidataset (Pointcluster,2),5) onEnd%hold onScatter (Bicentset (:,1), Bicentset (:,2), -,' + ') Hold offEnd% Euclidean distance calculation function dist = disteclud(VECA,VECB) dist = SUM (Power (VECA-VECB),2));End% K-means Algorithm function [centset,clusterassment] = Kmeans(dataset,k) [Row,col]=size(DataSet);% storage centroid MatrixCentset =Zeros(K,col);% randomly initialized centroid for I=1: Col minv = min (DataSet (:,I)); RANGV = Max (DataSet (:,I))-MINV; Centset (:,I) =Repmat(MINV,[K,1]) + rangv*RandK1);End% is used to store the cluster assigned to each point and the distance to the centroidClusterassment =Zeros(Row,2); clusterchange = true; whileClusterchange Clusterchange = false;% compute each point should be assigned the cluster     for I=1: Row% This part may be optimizedMindist =10000; Minindex =0; for J=1: K distcal = Disteclud (DataSet (I,:), Centset (J,:));if(Distcal < mindist) mindist = distcal; Minindex =J;End        End        ifMinindex ~= Clusterassment (I,1) Clusterchange = true;EndClusterassment (I,1) = Minindex; Clusterassment (I,2) = Mindist;End    % update centroid of each cluster     for J=1: K Simplecluster =Find(Clusterassment (:,1) ==J); Centset (J,:) = Mean (DataSet (Simplecluster ',:));EndEndEnd

The iterative process of the algorithm is as follows

Bicentset =

-0.1036    0.0543     0         0     0         0     0         0

The error after the 1th cluster is divided is: 792.916857

Bestclustertospilt =

Bestcentset =

   -0.2897   -2.8394    0.0825    2.9480

Bicentset =

   -0.2897   -2.8394    0.0825    2.9480     0         0     0         0

The error after the 1th cluster is divided is: 409.871545
The error after the 2nd cluster is divided is: 532.999616

Bestclustertospilt =

Bestcentset =

   -3.3824   -2.9473    2.8029   -2.7315

Bicentset =

   -3.3824   -2.9473    0.0825    2.9480    2.8029   -2.7315     0         0

The error after the 1th cluster is divided is: 395.669052
The error after the 2nd cluster is divided is: 149.954305
The error after the 3rd cluster is divided is: 393.431098

Bestclustertospilt =

Bestcentset =

2.6265    3.1087-2.4615    2.7874

Bicentset =

   -3.3824   -2.9473    2.6265    3.1087    2.8029   -2.7315   -2.4615    2.7874

Eventually

When using the binary K-means algorithm for clustering, the results of different initial centroid clusters will be slightly different, because in fact it is only the effect of weakening the random centroid on the clustering result, it can not eliminate its influence, but it can eventually converge to the global minimum.

Machine learning Combat Bymatlab (iv) binary K-means algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Machine learning Combat Bymatlab (iv) binary K-means algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Machine learning Combat Bymatlab (iv) binary K-means algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support