Data Analysis Fourth: Cluster analysis (Division)

Source: Internet
Author: User
Tags benchmark compact diff new set

Clustering is the process of dividing a dataset into subsets, each of which is called a cluster (Cluster), and clustering makes the objects in the cluster highly similar, but unlike the objects in other clusters, the set of clusters generated by clustering is called a cluster. On the same data set, different clustering algorithms may produce different clusters.

Cluster analysis is used to gain insight into the distribution of data, observe the characteristics of each cluster, and further analyze the characteristics of a particular cluster. Because clusters are subsets of data objects, objects within a cluster are similar to each other and are not similar to objects in other clusters, so clusters can be thought of as "hidden" classifications of datasets, and cluster analysis may find unknown groupings of datasets.

Clustering by observing learning, do not need to provide the affiliation of each training element, belong to unsupervised learning (unspervised learning), unsupervised learning refers to the non-labeled data to look for hidden structural information. The internal structure of the data depicts the objects to be grouped, and determines how best to group the data objects. The main clustering algorithms can be divided into the following four categories:

    • Dividing clusters (partitioning clustering)
    • Hierarchical clustering (hierarchical clustering)
    • Density-based approach
    • A method based on grid

This paper simply introduces the simplest partitioning clustering algorithm, dividing clustering is: given a set of N objects, partitioning method to construct the K-Grouping of data, where each partition represents a cluster, and k<= N, that is, divide the data into k groups, so that each group contains at least one object, The basic partitioning method takes a mutually exclusive cluster partition, which makes each object belong to only one cluster. In order to achieve the global optimal, the clustering based on partition needs to be exhaustive, the computational amount is great, in fact, the commonly used partitioning methods adopt heuristic methods, such as K-means (K-means), K-Center point (k-medoids), and gradually improve the clustering quality, approximate the local optimal solution, The heuristic clustering method is suitable for discovering small-scale spherical clusters.

A, K-means

K-Means (Kmeans) is a centroid-based count, each data element type is numeric, the K-means algorithm defines the centroid of the cluster as the mean of the intra-cluster point, the K-means algorithm process:

Input: K (number of clusters), D (data set with n data)

Output: A collection of k clusters

Algorithm:

  1. Arbitrarily select K objects from D as the initial cluster center;
  2. Repeat
  3. According to the mean value of the objects in Cu, each object is assigned to the most similar clusters;
  4. Update the mean of the cluster, that is, recalculate the mean value of the objects in each cluster;
  5. Until cluster mean value no longer changes

The K-means method is not guaranteed to converge to the global optimal solution, but it often terminates in the local optimal solution, and the result of the algorithm may depend on the random selection of the initial cluster center.

The K-means method does not apply to non-convex clusters, or clusters of very large sizes, in addition, it is sensitive to outliers, because a small number of outliers can have a great influence on the mean, affecting the distribution of other clusters.

In the R language, the Kmeans () function in the stats package is used to implement K-means clustering analysis:

Ten 1 ,        = C ("hartigan-wong" "Lloyd""forgy " , " Macqueen "), Trace=false)

Parameter comment:

    • x: Vector or matrix of numeric type
    • Centers: Number of clusters
    • Iter.max: The maximum number of repetitions, the default value is 10
    • Nstart: Number of random datasets, default value is 1
    • Algorithm: The selection of the algorithm, the default value is Hartigan-wong

The value returned by the function:

    • cluster: integer vector, integer value from 1 to K, indicating the number of the cluster
    • Centers: The center of each cluster, arranged by the number of clusters
    • Size: The number of points each cluster contains
Second, K-center point

The K-center point does not use the mean value of the object within the cluster as the reference point, but instead selects the actual data object p to represent the cluster, and each of the remaining objects is assigned to the cluster where the most similar objects represent p. Typically, K-centric point clustering is implemented using the partitioning Around medoids,pam algorithm around the center point.

The implementation process of PAM algorithm:

Input: K (number of clusters), D (data set with n data)

Output: A collection of k clusters

Algorithm:

  • Randomly select K objects from D as the initial representative object or seed;
  • Repeat
  • Assign each remaining object to the cluster where the nearest represented object P resides;
  • Randomly select a non-representative object R
  • The calculation uses r instead of representing the total cost of the object P s;
  • If S<0 then uses R instead of p to form a new set of K-represented objects;
  • until represent objects no longer changing

When outliers are present, the K-center point method is more robust than the K-means method (so-called "robustness", which means that the control system maintains certain performance characteristics under certain (structural, size) parametric perturbations, because the center point is not as susceptible to outliers or other extremes as the mean method. However, when the values of N and K are large, the cost of the K-center point calculation becomes quite large, much higher than the K-means method.

In the R language, the PAM () function in the cluster package is used to implement K-Center point Clustering Analysis:

" Dist "  = C ("Euclidean""Manhattan"= NULL, stand = FALSE, ...)

Parameter comment:

    • diss: Logical value, the default value is True, when Diss is true, the x parameter is considered to be a distinct matrix (dissimilarity), and when Diss is false, the x parameter is considered to be an observation matrix containing the variable.
    • Metric: A character type that specifies the measure (mertic) used to calculate the difference between two observations, the valid values are "Euclidean" and "Manhattan", respectively, to calculate Euclidean distance and Manhattan distance, if the diss parameter is True , then the parameter is ignored.
    • medoids: The default value is NULL, which specifies the initial center point of the cluster.
    • stand: Logical value, the default value is False. If set to True, the X is normalized by column (dimensionless), that is, the data for each column is normalized before the different measures are computed for x. The standardized operation algorithm is to standardize the measurement of each variable (column) by subtracting the average of the variable and dividing it by the average absolute deviation of the variable. If x is already a distinct matrix, the parameter is ignored.

The object returned by the function:

    • medoids: The center point of each cluster
    • Clustering: vectors that represent the clusters to which each observation belongs
    • silinfo: Contour data, including contour coefficients for each observation, average contour coefficients per cluster, and average contour coefficients for the entire data set.
Third, cluster assessment

When we try a clustering method on a dataset, how do we evaluate the results of clustering? Generally, clustering assessment mainly includes the following tasks:

    • Estimating Clustering Trends: clustering is meaningful only if there are non-random data in the data, so there is no random data on the evaluation dataset.
    • To determine the number of clusters in a dataset: estimate the number of clusters before using the clustering algorithm
    • Determination of cluster Quality: After the clustering method is used on the dataset, the quality of the cluster needs to be evaluated.

1, estimated cluster trend

Clustering requires data to be unevenly distributed, and Hopkins statistics is a spatial statistic used to test the spatial randomness of spatial distribution variables.

The Comato package has a hopkins.index () function that calculates the Hopkins index, which is close to 0.5 if the data distribution is uniform, and the value is close to 1 if the data distribution is highly skewed.

Library (Comato) hopkins.index (data)

To use Hopkins.index () to evaluate the spatial randomness of data, the data needs to be treated in a non-dimensional format.

2, determine the optimal number of clusters

Typically, the Elbow method (elbow) is used to determine the best cluster number of clusters, and the elbow method is valid based on the following observations: increasing the number of clusters helps to reduce the sum of the intra-cluster variance of each cluster, given the k>0, computes the variance within the cluster and VAR (k), plots the Var curve on K, The first (or most significant) inflection point of a curve implies the correct number of clusters.

The Sjc.elbow () function in the Sjplot package implements the Elbow method, which can be used to determine the number of clusters of K-means clustering:

Show.diff = FALSE)

Parameter comment:

    • Steps: The maximum number of elbow values
    • Show.diff: The default value is False, drawing an additional graph, connecting each elbow value, to show the difference between the various elbow values, to help identify the "elbow", suggesting "correct" number of clusters.

The Sjc.elbow () function is used to draw the elbow value of the K-means cluster analysis, which calculates the K-means cluster analysis in the specified data frame, producing two graphs: one graph has different elbow values, and the other is the connection of each "step" on the y-axis, that is, drawing lines between the adjacent elbow values, The inflection point of the curve in the second figure may imply the "correct" number of clusters.

Draw the value of the elbow of K-mean cluster analysis. The function calculates the K-means clustering on the provided data frame and produces two graphs: one with different elbow values and the other for the difference between each "step" (that is, between the elbow values) on the y-axis. An increase in the second figure may indicate the elbow standard.

= FALSE)

From the elbow value graph, you can see that the inflection point of the curve is 3, and you can also use the Nbclust () function of the Nbclust package, which provides 26 different indicators to help determine the final number of clusters.

2  the " Kmeans " ) Barplot (table (nc$best.nc[1,]), xlab="number ofClusters", ylab= " Number of Criteria ", main="number ofClusters Chosenby")

From the bar chart, you can see that the number of indicators supporting the number of clusters is 3 is the most, so it is basically possible to determine that the number of clusters of K-means clustering is 3.

3, determination of cluster quality

How do you compare clusters generated by different clustering algorithms? In general, the methods for determining clustering quality are divided into two categories based on whether there are benchmarks available: external methods and intrinsic methods.

A benchmark is an ideal cluster, usually built by experts. If there are available benchmarks, then the clustering and the benchmark can be compared, this method is called the external method. If no benchmark is available, it is called an intrinsic method to evaluate the cluster's quality by considering the separation of the clusters.

(1) External methods

When a baseline is available, use bcubed precision (precision) and recall rate (recall) to evaluate the quality of the cluster.

Bcubed evaluates the accuracy (precision) and recall rate (recall) for each object in the cluster on a given dataset, based on the benchmark. The precision of an object is how many other objects in the same cluster belong to the same category as the object; The recall rate of an object reflects how many objects of the same category are assigned to the same cluster.

(2) Internal methods

When no datum for a dataset is available, an intrinsic method must be used to evaluate the quality of the cluster. In general, the intrinsic method evaluates clustering by examining the separation of clusters and the compactness of clusters, and the usual intrinsic methods are implemented using similarity metrics between the objects of the dataset.

The contour factor (silhouette coefficient) is the similarity metric, where the value of the contour factor is between 1 and 1, the closer the value is to 1, the more compact the cluster, and the better the cluster. When the contour factor is close to 1 o'clock, the cluster is compact and away from the other clusters.

If the contour factor sil is close to 1, the sample cluster is reasonable, if the contour factor sil is close to 1, then the sample I should be classified to another cluster, if the contour factor Sil is approximately 0, then the sample I is on the boundary of the two clusters. The mean value of the contour factor sil of all samples is called the contour factor of the clustering result, and is the reasonable and effective measure of the cluster.

In the package FPC, some evaluation indexes of calculating clustering are realized, including contour factor: avg.silwidth (average contour width)

<-<-<-stats$avg.silwidth

The function of calculating contour coefficients is also included in the package Cluster silhouette ():

<-<-pam (DIS,3<-<-Dist (data) ^2<-Kmeans ( Data,3<-Silhouette (res$cluster, dis)
Four, K-Means clustering analysis practice

Effective cluster analysis is a multi-step process in which each decision can affect the quality and effectiveness of the clustering results, and we use K-means clustering to process the data set of 13 chemical constituents in the wine, which can be obtained through the rattle package.

Install.packages ("Rattle") data (Wine,package="Rattle") Head (wine) Type Alcohol malic Ash alcalinity magnesium phenols flavanoids nonflavanoids proanthocyanins Color Hue Dilution Proline1    1   14.23  1.71 2.43       15.6       127    2.80       3.06          0.28            2.29  5.64 1.04     3.92    10652    1   13.20  1.78 2.14       11.2        -    2.65       2.76          0.26            1.28  4.38 1.05     3.40    10503    1   13.16  2.36 2.67       18.6       101    2.80       3.24          0.30            2.81  5.68 1.03     3.17    11854    1   14.37  1.95 2.50       16.8       113    3.85       3.49          0.24            2.18  7.80 0.86     3.45    14805    1   13.24  2.59 2.87       21.0       118    2.80       2.69          0.39            1.82  4.32 1.04     2.93     7356    1   14.20  1.76 2.45       15.2        the    3.27       3.39          0.34            1.97  6.75 1.05     2.85    1450

1. Select the appropriate variable

The first variable, type, can be ignored, the other 13 variables are 13 of the total chemical composition of the wine, select the 13 variables for cluster analysis. Because variable values vary greatly, they need to be normalized before clustering.

2, standardized data

Use the scale () function to manipulate the data in a non-dimensional format,

DF <-Scale (wine[,-1])

3, Assessing trends in clustering

Variable df variable value is dimensionless, you can directly use the function Hopkins.index () to calculate the randomness of the spatial distribution of the data, the result is closer to 1, indicating that the spatial distribution of the data is highly inclined, the higher the spatial randomness.

Install.packages ("Comato")
Library (Comato) Hopkins.index (DF) [10.7412846

4, determine the cluster number of clusters

Using the Sjc.elbow () function in the Sjplot package to calculate the elbow value, the inflection point of the curve is about 3, which indicates that the cluster analysis using the K-means method can set the number of clusters is about 3.

Install.packages ("effects") install.packages ("sjplot")   = FALSE)

From the elbow value graph, you can see that the inflection point of the curve is 3, and you can also use the Nbclust () function of the Nbclust package, which provides 26 different indicators to help determine the final number of clusters.

 install.packages ( nbclust   " " library (Nbclust) NC  <-nbclust (df,min.nc = 2 , max.nc = 15 , method =  " kmeans   ) Barplot (table (nc$best.nc[  1 ,]), Xlab= " number of Clusters  , Ylab="  Span style= "color: #800000;" >number of Criteria  , Main="  Span style= "color: #800000;" >number of Clusters Chosen by-Criteria  ) 

From the bar chart, you can see that the number of indicators supporting the number of clusters is 3 is the most, so it is basically possible to determine that the number of clusters of K-means clustering is 3.

5, determine the quality of the cluster

Using contour coefficients to determine the mass of the cluster, the value of the contour factor is between 1 and 1, the closer the value is to 1, the more compact the cluster, the better the cluster. When the contour factor is close to 1 o'clock, the cluster is compact and away from the other clusters.

Install.packages ("FPC")

Library (FPC)for in2:9<-<-cluster.stats (Dist (DF ^2<- stats$avg.silwidthprint (Paste0 (k,'-' ) , SLI)}

When the number of clusters is 3 o'clock, the contour factor of cluster 0.45 is the best.

[1]"2-0.425791262898175"[1]"3-0.450837233419168"[1]"4-0.35109709657011"[1]"5-0.378169006474844"[1]"6-0.292436629924875"[1]"7-0.317163857046711"[1]"8-0.229405778112672"[1]"9-0.291438101137107"

Reference Documentation:

Multiple common clustering models and clustering quality assessment (clustering considerations, usage Tips)

10 ways to determine the optimal number of clusters

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.