Data analysis Sixth: Clustering assessment (cluster determination and contour factor) and visualization

Last Update:2018-08-25 Source: Internet

Author: User

Tags compact diff

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

In the actual clustering application, the K-means and K-centric algorithm are usually used for cluster analysis, both of which need to enter the number of clusters, in order to ensure the quality of clustering, we should first determine the best cluster number, and use contour coefficients to evaluate the results of clustering.

First, K-means to determine the optimal number of clusters

Typically, the Elbow method (elbow) is used to determine the best cluster number of clusters, and the elbow method is valid based on the following observations: increasing the number of clusters helps to reduce the sum of the intra-cluster variance of each cluster, given the k>0, computes the variance within the cluster and VAR (k), plots the Var curve on K, The first (or most significant) inflection point of a curve implies the correct number of clusters.

1, use the Sjc.elbow () function to calculate the elbow value

The Sjc.elbow () function in the Sjplot package implements the Elbow method for calculating the elbow value of the K-means clustering analysis to determine the optimal number of clusters:

Library (Sjplot) Sjc.elbow (data, steps = all, Show.diff = FALSE)

Parameter comment:

Steps: The maximum number of elbow values
Show.diff: The default value is False, drawing an additional graph, connecting each elbow value, to show the difference between the various elbow values, to help identify the "elbow", suggesting "correct" number of clusters.

The Sjc.elbow () function is used to draw the elbow value of the K-means cluster analysis, which calculates the K-means cluster analysis in the specified data frame, producing two graphs: one graph has different elbow values, and the other is the connection of each "step" on the y-axis, that is, drawing lines between the adjacent elbow values, The inflection point of the curve in the second figure may imply the "correct" number of clusters.

Draw the value of the elbow of K-mean cluster analysis. The function calculates the K-means clustering on the provided data frame and produces two graphs: one with different elbow values and the other for the difference between each "step" (that is, between the elbow values) on the y-axis. An increase in the second figure may indicate the elbow standard.

Library (effects) library (Sjplot) library (Ggplot2) sjc.elbow (Data,show.diff = FALSE)

From the elbow value diagram below, you can see that the inflection point of the curve is approximately around 5:

2, use the Nbclust () function to verify the elbow value

From the upper elbow value graph, you can see that the inflection point of the curve is 3, and you can also use the Nbclust () function of the Nbclust package, which by default provides 26 different metrics to help determine the final number of clusters.

" Euclidean " 2  the "  All " 0.1)

Parameter comment:

Diss: The dissimilarity matrix (dissimilarity matrix), the default value is NULL, and if the diss parameter is not NULL, the distance parameter is ignored.
Distance: The distance metric used to calculate the dissimilarity matrix, valid values are: "Euclidean", "Maximum", "Manhattan", "Canberra", "binary", "Minkowski" and "NULL". If the distance is not Null,diss (the dissimilarity matrix) parameter must be null.
MIN.NC: Minimum number of clusters
MAX.NC: Maximum number of clusters
Method: Used for clustering analysis, the valid value is: "Ward." D "," Ward. D2 "," single "," Complete "," average "," mcquitty "," median "," centroid "," Kmeans "
Index: The indicator for calculation, the Nbclust () function provides 30 indices, the default value is "All", which refers to 26 indicators except Gap, Gamma, Gplus, and Tau.
The significance value of the Alphabeale:beale index

Use the Nbclust () function to determine the optimal number of clusters for K-means clustering:

Library (nbclust) NC <-nbclust (data,min.nc = 2,MAX.NC = 15,method = "Kmeans") barplot (Table (nc$best.nc[1,]), xlab= " Number of Clusters ", ylab=" number of criteria ", main=" number of Clusters Chosen by ")

From the bar chart, you can see that the number of indicators supporting the number of clusters is 3 is the most, so it is basically possible to determine that the number of clusters of K-means clustering is 3.

Second, K-centric to determine the optimal number of clusters

There are two implementations of K-centric clustering, Pam and Clara,pam are suitable for running on small datasets, the Clara algorithm is based on sampling, regardless of the entire dataset, but instead uses a random sample of the dataset, and then uses the Pam method to calculate the best center point for the sample.

The optimal number of clusters is obtained through the PAMK () function in the FPC package:

PAMK (data,krange=2:ten, criterion="ASW", usepam=TRUE,     Scaling=false, alpha=0.001"dist"),     critout=false, ns= Ten, Seed=null, ...)

Parameter comment:

Krange: integer vector, used to denote the number of clusters
Criterion: Valid values are: "ASW" (default value), "MULTIASW" and "ch"
Usepam: Logical value, if set to true, then use the PAM algorithm, if False, then use the Clara algorithm.
Scaling: Logical value, whether the data is scaled (normalized), if set to False, do not do any scaling of the data parameter, if set to true, then the data parameter by dividing (middle) variable by their root mean square to complete the scaling.
Diss: Logical value, if set to true, indicates that the data parameter is a dissimilarity matrix, and if set to False, then the data parameter is the observation matrix.

Use the PAMK () function to obtain the optimal number of clusters for Pam or Clara clustering:

<- Pamk (DataSet) Pamk.best$nc

View the results of the cluster by using the Clusplot () function in the cluster package:

Library (Cluster) Clusplot (PAM (DataSet, PAMK.BEST$NC))

Third, assess the quality of the cluster (contour factor)

Using the similarity measure between the objects in the dataset to evaluate the mass of the cluster, the contour factor (silhouette coefficient) is the similarity measure, and is the evaluation index of the cluster's dense and dispersed degree. The value of the contour factor is between 1 and 1, the closer the value is to 1, the more compact the cluster, the better the cluster. When the contour factor is close to 1 o'clock, the cluster is compact and away from the other clusters.

If the contour factor sil is close to 1, the sample cluster is reasonable, if the contour factor sil is close to 1, then the sample I should be classified to another cluster, if the contour factor Sil is approximately 0, then the sample I is on the boundary of the two clusters. The mean value of the contour factor sil of all samples is called the contour factor of the clustering result, and is the reasonable and effective measure of the cluster.

1,FPC Bag

In the package FPC, some evaluation indexes of calculating clustering are realized, including contour factor: avg.silwidth (average contour width)

Library (FPC) result <-Kmeans (data,k) stats <-cluster.stats (Dist (data) ^2, Result$cluster) SLI <-stats$ Avg.silwidth

2,silhouette () function

The function silhouette () that calculates the contour coefficients in the package cluster returns the average contour width of the cluster:

Silhouette (x, dist, Dmatrix, ...)

Parameter comment:

x: Integer vector, which is the result of the clustering algorithm
Dist: the dissimilarity matrix (the result of the dist () function), if the dist parameter is not specified, then the Dmatrix parameter must be specified;
Dmatrix: A different matrix of symmetry, used instead of the dist parameter, is more efficient than the dist parameter

Use Silhouette () to calculate the contour factor:

<-<-pam (DIS,3<-<-Dist (data) ^2<-Kmeans ( Data,3<-Silhouette (res$cluster, dis)

Iv. Visualization of clusters

Clustering results, you can try ggplot2 to visualize, you can also use some of the clustering package specific functions to achieve: Factoextra package, Sjplot package and cluster package

1,cluster Bag

Clusplot () function

2,sjplot Bag

Sjc.qclus () function

3,factoextra Bag

The two functions in this package are useful, one for determining the optimal number of clusters, and one for visualizing clustering results.

(1), determine the optimal number of clusters fviz_nbclust ()

function Fviz_nbclust (), used to classify clustering, use contour coefficients, WSS (intra-cluster squared error and) to determine and visualize the optimal number of clusters

Fviz_nbclust (x, Funcluster = NULL, method = C ("silhouette""wss"  ,ten, ...)

Parameter comment:

Funcluster: Functions for clustering, available values are: Kmeans, cluster::p am, Cluster::clara, Cluster::fanny, hcut, etc.
Method: An indicator for evaluating the optimal number of clusters
Diss: The dissimilarity matrix, the object produced by the Dist () function, if set to NULL, indicates that the data parameter is computed using dist (data, method= "Euclidean") to obtain the dissimilarity matrix;
K.max: Maximum number of clusters, at least 2

For example, using Kmenas for clustering, the average contour width is used to evaluate the cluster number of clusters:

" Silhouette ")

(2) Visualization of the results of clustering

The Fviz_cluster () function is used for the result of clustering:

Fviz_cluster (Object, data = null, Choose.vars = NULL, stand =TRUE, Axes= C (1,2), Geom = C (" Point","text"), Repel =FALSE, Show.clust.cent= True, Ellipse = true, Ellipse.type ="Convex", Ellipse.level=0.95, Ellipse.alpha =0.2, shape =NULL, Pointsize=1.5, labelsize = A, main ="Cluster Plot", Xlab =NULL, Ylab= NULL, Outlier.color ="Black", Outlier.shape = +, Ggtheme= Theme_grey (), ...)

Parameter comment:

Object: The result of a clustering function calculation
Data: Raw Object DataSet

Use Fviz_cluster () to display the results of the cluster:

Km.res <-Kmeans (DataSet,3= DataSet)

Reference Documentation:

10 ways to determine the optimal number of clusters

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More