Clustering validity--the optimal number of clusters

Source: Internet
Author: User

There are two kinds of evaluation criteria for clustering validity: one is the external standard, the result of clustering is evaluated by measuring the consistency of the clustering result and the reference standard, and the other is the internal index, which is used to evaluate the good degree of clustering result under different clustering numbers in the same clustering algorithm, which is usually used to determine the best clustering number of the dataset.
a method of determining the best clustering number
For internal indicators, there are usually three types: indicators based on fuzzy partition of data sets, indicators based on data set sample geometry, and indicators based on dataset statistics. Based on the data set sample geometry, the clustering results are evaluated according to the statistical characteristics of the dataset itself and the clustering results, and the best clustering numbers are selected according to the merits of the clustering results, which have Calinski-harabasz (CH) index, Davies-bouldin (DB) Indicators weighted Inter-intra (Wint) indicators, Krzanowski-lai (KL) indicators, Hartigan (Hart) indicators, in-group proportion (IGP) indicators, etc. This paper mainly introduces the Calinski-harabasz (CH) index and the Davies-bouldin (DB) indicator.
(1) CH index
The CH indicator describes the degree of tightness through the intra-class dispersion matrix, and the Inter-class dispersion matrix describes the separation, which is defined as

where n represents the number of clusters, K represents the current class, TrB (k) represents the trace of the inter-class dispersion matrix, and TrW (k) represents the trace of the dispersion matrix within the class. For a more detailed explanation of the formula, refer to the paper "A Dendrite method for cluster analysis".
It can be concluded that the larger the CH, the more closely the class itself, the more dispersed between classes and classes, and better clustering results.

(2) DB indicator
The DB indicator is defined as the spacing between the classes of the sample and the center of the cluster.

where k is the number of clusters, WI represents the average distance from all samples in the class CI to its cluster center, WJ represents the average distance from all samples in class CI to the class CJ Center, and CIJ represents the distance between the class CI and the CJ Center. As you can see, the smaller the DB, the lower the similarity between classes and classes, which corresponds to the better clustering results.

The process of determining the optimal number of clusters is generally the following: Given the range of K [Kmin,kmax], the data set using a different number of clusters K run the same clustering algorithm, a series of clustering results, the validity of each result calculation of the value of the indicator, and finally compare the values of each indicator, The number of clusters that correspond to the best indicator values is the optimal number of clusters.

Two experimental results
In MATLAB, function Evalclusters provides four ways to evaluate clustering effects, including ' Calinskiharabasz ', ' Daviesbouldin ', ' gap ', ' silhouette '. Select a set of data for clustering effect evaluation. Here choose the ' Calinskiharabasz ', ' daviesbouldin ' indicators, clustering algorithm selection K-means.

(1) CH index
Given the range of k values, the CH index of each cluster result is calculated, and the K value corresponding to the maximum index value is the optimal value.

(2) DB indicator
Given the K-value range, the DB indicator of each cluster result is computed, and the maximum value corresponds to the K-value for the optimal value.

Note: Nan occurs because neither of these methods is suitable for cases with a cluster number of 1.

Matlab code

cluster = zeros (Size (data,1), 3); for I=1:3 cluster (:, i) = Kmeans (data,i, ' replicate ', 5);%%% 


Save each cluster result end eva = Evalclusters (Data,cluster, ' Daviesbouldin ');
Subplot (1,3,1);
Plot (Data (cluster (:, 1) ==1,1), Data (cluster (:, 1) ==1,2), ' r* ');
On subplot (1,3,2);
Plot (Data (cluster (:, 2) ==1,1), Data (cluster (:, 2) ==1,2), ' r* ');
Hold on Plot (data (cluster (:, 2) ==2,1), Data (cluster (:, 2) ==2,2), ' b* ');
On subplot (1,3,3);
 DATA=[C1 R1];
[Idx,ctrs] = Kmeans (data,3);
Plot (Data (cluster (:, 3) ==1,1), Data (cluster (:, 3) ==1,2), ' r* ');
Hold on Plot (data (cluster (:, 3) ==2,1), Data (cluster (:, 3) ==2,2), ' b* ');
Hold on Plot (data (cluster (:, 3) ==3,1), Data (cluster (:, 3) ==3,2), ' k* '); Hold on 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.