Hierarchical clustering
produce nested sets of clusters
Function
Cluster |
Condensation clustering based on aggregation hierarchical clustering tree |
Clusterdata |
Constructing condensation clustering based on sample data |
Cophenet |
Cophenet correlation coefficient |
Inconsistent |
Inconsistent coefficient |
Linkage |
Condensed Hierarchical clustering Tree |
Pdist |
22 average of distance between objects |
Sequentialfs |
Characteristic selection of sequential sequence |
Squareform |
to the distance matrix format |
Cluster
Condensation clustering based on aggregation hierarchical clustering tree
Grammar
T = cluster (Z, ' cutoff ', c)
T = cluster (Z, ' cutoff ', C, ' depth ', D)
T = cluster (Z, ' cutoff ', C, ' criterion ', criterion)
T = cluster (Z, ' Maxclust ', N)
Describe
T = cluster (z, ' Cutoff ', c) constructs a cluster from a condensed hierarchical cluster tree Z, where z is generated by the linkage function. Z is a matrix of m-1 row 3 columns, where M is the number of observations in the original data. C is the threshold for slicing z into clusters. If a node and all its child nodes have inconsistent values less than C, then a cluster is formed. All leaf nodes on or under the node are merged into a cluster. T is an M-dimensional vector that contains the allocation of each observation value.
If c is a vector, T is a cluster allocation matrix. Each of these cutoff values corresponds to a column cluster allocation in the matrix.
T = cluster (Z, ' cutoff ', C, ' depth ', D) find the inconsistent value by finding the D layer below each node. The default number of layers is 2.
T = cluster (Z, ' cutoff ', C, ' criterion ', criterion) uses the established criteria to form a cluster, where criterion is ' inconsistent ' (default) or ' distance '. The ' distance ' standard measures the height of a node by merging the distances between the two child nodes of a node. If the height of all leaf nodes on a node and below it is less than C, they are combined into one cluster.
T = cluster (Z, ' Maxclust ', n) constructs the maximum value of an n cluster using the ' distance ' standard. Cluster found a minimum height at which the horizontal cutting tree had n or smaller clusters.
If n is a vector, T is a matrix, and each maximal value corresponds to a column in the matrix.
Example
Compare the Anderson Iris Floral data set with the species category
Load Fisheriris
D = pdist (MEAs);
Z = linkage (d);
c = Cluster (Z, ' maxclust ', 3:5);
Crosstab (C (:, 1), species)
Ans =
0 0 2
0 50 48
50 0 0
Crosstab (C (:, 2), species)
Ans =
0 0 1
0 50 47
0 0 2
50 0 0
Crosstab (C (:, 3), species)
Ans =
0 4 0
0 46 47
0 0 1
0 0 2
50 0 0
Clusterdata
Aggregation Clustering of data
Grammar
T = Clusterdata (X,cutoff)
T = Clusterdata (x,name,value)
Describe
T = Clusterdata (X,cutoff)
T = Clusterdata (X,name,value) has one or more names, and the value parameter sets the cluster to another special option.
Input parameters
X |
A matrix with a row number greater than or equal to 2. Each row represents an observation value, and each column represents a category or dimension. |
Cuttoff |
When 0<cutoff<2, Clusterdata form a cluster, making all inconsistent values greater than cutoff. When cutoff is an integer greater than or equal to 2, Clusterdata understands cutoff as a cluster generated by linkage, so that the maximum value of the cut can remain in the cluster tree. |
Name-numeric parameter pair
Specifies a comma-delimited name, with the value optional parameter pair. Name is the name of the parameter, and value is the appropriate value. The name must be quoted (') outside. You can specify some name-value parameters in any order, such as Name1,value1,..., Namen,valuen.
Input parameters
' Criterion ' |
' Inconsistent ' or ' distance ' |
' Cutoff ' |
The truncated value of the inconsistent or distance metric is a positive scalar. When 0<cutoff<2, Clusterdata form a cluster, making all inconsistent values greater than cutoff. When cutoff is an integer greater than or equal to 2, Clusterdata understands cutoff as a cluster generated by linkage, so that the maximum value of the cut can remain in the cluster tree. |
' Depth ' |
The depth is used to calculate the inconsistent value, which is a positive integer. |
' Distance ' |
Any pdist recognized distance metric name (' Minkowski ' option followed by exponential value p):
Measure |
Describe |
' Euclidean ' |
Euclidean distance (default value) |
' Seuclidean ' |
Standardized Euclidean distance. The difference of each coordinate between the X rows is resized by dividing the corresponding value of the standard deviation s=nanstd (x). If you want to specify an additional value for S, use D=pdist (X, ' Seuclidean ', s). |
' Cityblock ' |
Urban block metrics. |
' Minkowski ' |
Minkowski distance. The default index is 2. To also know an exponent, use D=pdist (X, ' Minkowski ', p), where p is the exponential value and is a positive scalar value. |
' Chebychev ' |
Chebyshev Snow Distance (coordinate difference). |
' Mahalanobis ' |
The Markov distance, as Nancov, calculates the sample covariance with X. If you are using another covariance, use d= (X, ' Mahalanobis ', c), where C is a positive definite symmetric matrix. |
' Cosine ' |
1 the Cos value minus the angle between two points (considered vectors) |
' Correlation ' |
1 minus the correlation coefficients between two points (as vectors) |
' Spearman ' |
Spearman rank correlation coefficient between 1 minus two observations (as a sequence of values) |
' Hamming ' |
Hamming distance, the ratio of different value coordinates. |
User Distance function |
Distance function specified by @: D = Pdist (X, @disfunctional) A distance function must be in the following form: D2 = Distfun (XI,XJ) As a parameter, one is a vector XI of 1 rows n columns, one row for x, and a matrix XJ with a M2 row n column, corresponding to multiple lines of X. Distfun must accept any number of XJ matrix rows. The Distfun must return a 1-dimensional vector of m2 lines at a distance of D2, where the k element is the distance of the Xi and XJ (K,:). |
|
' Linkage ' |
Any linkage method allowed by the linkage function: ' Average ' ' Centroid ' ' Complete ' ' Median ' ' Single ' ' Ward ' ' Weighted ' |
' Maxclust ' |
The maximum number of clusters, which is a positive integer. |
' Savememory ' |
A string that is ' on ' or ' off '. When available, the ' on ' setting allows Cluserdata to construct clusters without calculating the distance matrix. Savememory when the following conditions are available: Linkage is ' centroid ', ' median ' or ' ward ' Distance is ' Euclidean ' (default) When Savememory is ' on ', the linkage run time and the number of dimensions (number of columns in x) are proportional. When Savememory is ' off ', the demand for linkage memory is proportional to N2, where n is the number of observations. The best (and least time-consuming) savememory settings for all choices depend on the dimension of the problem, the number of observations, or the available memory. The default Savememory setting is a rough approximation of the optimal setting. Default: ' On ' When x has less than or equal to 20 columns, or if the computer does not have enough memory to store the distance matrix; |
Help documentation-Translation-statistics toolbox-exploratory Data analysis-cluster analysis-hierarchical Clustering (cluster,clusterdata) ( 1)