MATLAB provides a series of functions for clustering analysis, summed up the specific methods are as follows:
Method One: Direct clustering, using Clusterdata function to cluster the sample data, its disadvantage is that the user can choose a narrow face, can not change the distance calculation method, the method users do not need to understand the principle and process of clustering, but the clustering effect is limited.
Method Two: Hierarchical clustering, the method is more flexible, need to carry out detailed understanding of the clustering principle, the following process needs to be processed: (1) Find the similarity and non-similarity between the variables 22 in the data set, and use the Pdist function to calculate the distance between variables; (2) Define the connection between variables with the linkage function ; (3) The Cophenetic function is used to evaluate the clustering information, and (4) to create the cluster with the cluster function.
Method Three: Dividing clustering, including K-means clustering and K-centric clustering, also requires a series of steps to complete the process, requiring users to have a clearer understanding of the principle and process of clustering.
Next, we introduce the related functions and related clustering methods in MATLAB.
1. Introduction to the related functions in MATLAB
1.1 Pdist function
Call Format: y=pdist (x, ' metric ') Description: Calculates the distance between objects in the X data matrix with the method specified by ' metric '. X: A matrix of MXN, which is a dataset of M objects, each of which has a size of N. Metric ' values are as follows: ' Euclidean ': Euclidean distance (default), ' Seuclidean ': standardized Euclidean distance, ' Mahalanobis ': Markov distance; ' Cityblock ': Brock distance; ' Minkowski ': Minkowski distance; ' cosine ': ' Correlation ': ' Jaccard ': ' Chebychev ': Chebychev distance.
1.2 Squareform function
Call format: Z=squareform (Y,..)
1.3 Linkage function
Call format: Z=linkage (Y, ' method ') input value Description: Y is the pdist function returns the m* (M-1)/2 elements of the line vector, using the ' method ' parameter specified by the algorithm to compute the cluster tree. Method: The desired value is as follows: ' single ': Shortest distance method (default), ' complete ': Longest distance method, ' average ': Weighted average distance method, ' weighted ': weighting average method, ' centroid ': centroid distance method; ' Median ': weighted centroid distance method; ' Ward ': Inner square Distance method (minimum variance algorithm) The return value description: Z is a matrix containing the cluster tree information (m-1) x3, where the first two columns are indexed, indicating which two ordinal samples can be clustered into the same class, and the third column is the distance between the two samples. In addition, for each newly generated class, in addition to the M samples, m+1, m+2 are used sequentially 、... to identify. To represent the Z-matrix, we can use a more intuitive clustering number to display, the method is: Dendrogram (Z), the resulting cluster number is an n-type tree, the bottom of the sample, and then the first level of clustering, and eventually become the top class. The vertical axis height represents the distance column. In addition, you can set the number of samples at the bottom of the number of clusters, the default is 30, can be modified according to Dendrogram (z,n) parameter n to achieve, 1<n<m. Dendrogram (z,0) shows all leaf nodes in the case of table n=m.
1.4 Dendrogram function
Call format: [H,t, ...] =dendrogram (z,p, ...) Description: Generates an icicle Chart (pedigree chart) with only the top p nodes.
1.5 cophenetic function
Calling format: C=cophenet (z,y) Description: Calculates the cophenet correlation coefficients using the Z calculation generated by the Y and linkage functions generated by the Pdist function. Cophene the degree of matching between the two-fork cluster tree and the actual condition produced under certain algorithm is to detect the correlation between the distance between each element in the binary cluster tree and the actual distance generated by the pdist calculation. You can also use inconsistent to quantify the difference between nodes in a hierarchical cluster.
1.6 Cluster function
Call format: T=cluster (Z,...) Description: Creates a classification based on the output Z of the linkage function.
1.7 Clusterdata function
Call format: T=clusterdata (X,...) Description: Creates a classification based on data. T=clusterdata (X,cutoff) is equivalent to the following set of commands: Y=pdist (X, ' Euclid '); Z=linkage (Y, ' single '); T=cluster (Z,cutoff);
2. The design of MATLAB Clustering program
2.1 Method One: One-time clustering method x=[11978 12.5 93.5 31908; ...; 57500 67.6 238.0 15900]; T=clusterdata (x,0.9)
2.2 Method Two and method three design flow: Step cluster STEP1 uses pdist function to calculate the similarity matrix, there are many ways to calculate the distance, it is better to first standardize the data with Zscore function before calculating. X2=zscore (X); Y2=pdist (X2); % calculated distance Step2 z2=linkage (Y2); Step3 c2=cophenet (Z2,Y2); 0.94698 STEP4 create clusters and make pedigree maps t=cluster (z2,6);
MATLAB provides functions for cophenet, inconsistent, etc. that represent correlations. Cophenet and inconsistent are used to calculate some coefficients, the former is used to test the two-fork cluster tree produced by a certain algorithm and the actual degree of compliance (that is, the detection of the distance between the elements in a binary cluster tree and the actual distance generated by the pdist calculation), Inconsistent is the quantification of the differences between nodes on a hierarchical cluster (which can be used as a clipping standard for cluster).
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.