Hierarchical Clustering Principle:
Well. Sort the diagram. Divide and conquer. Yes, unlike prototype clustering and density clustering, hierarchical clustering attempts to partition sample datasets on different "levels", clustering them one layer at a a-level. The partitioning strategy can be divided into bottom-up condensation methods (agglomerative hierarchical clustering), such as AGNES. From the upper-downward division method (divisive hierarchical clustering), such as DIANA.
AGNES Each point of all samples as a cluster, and then finds the two clusters with the smallest distance to merge, repeating to the expected cluster or other termination conditions.
DIANA First treats all the samples as a whole cluster, and then finds the two clusters that are farthest from the cluster to split, repeating to the expected cluster or other termination conditions.
how to determine the distance between the two cluster.
1. Minimum distance, single Linkage
A recent sample of two clusters is determined.
2. Maximum distance, full link complete Linkage
The farthest sample of two clusters is determined.
3. Average distance, all links average Linkage
All samples of two clusters are determined together.
Both 1 and 2 are susceptible to extreme values, and 3 of this method is computationally significant, but the measure is more reasonable.
Similar to decision trees, hierarchical clustering has the advantage of being able to get the whole tree at once, and controlling certain conditions, whether depth or width, is controllable, but it has many problems:
Calculation Amount
The division determines that no further changes can be made
Cohesion and division are combined with each other.
Select "Best" each time
Greedy algorithm, easy to local optimization, can be done by the appropriate random operation.
or using a balanced iteration protocol and clustering (Balanced iterative reducing and clustering Using hierarchies,BIRCH) It first divides the adjacent sample points into tiny clusters ( Microcluseters), and then use the K-means algorithm for these clusters.
HC Applications:
Agglomerativeclustering parameter Description:
agglomerativeclustering (affinity= ' Euclidean ', compute_full_tree= ' auto ', Connectivity=none, linkage= ' Ward ', Memory=memory (Cachedir=none), n_clusters=6,pooling_func=)
Affinity= ' Euclidean ': Distance metric
Connectivity: Is there a connectivity constraint
linkage= ' ward ': Link way
Memory: storage mode
n_clusters= 6: Number of cluster classes
Import NumPy as NP
import Matplotlib.pyplot as Plt
import mpl_toolkits.mplot3d.axes3d as P3
from Sklearn.cluster import agglomerativeclustering from
sklearn.datasets.samples_generator import Make_swiss_roll
N_samples =
noise = 0.05
X, _ = Make_swiss_roll (N_samples, noise) #卷型数据集
#进行放缩
x[:, 1] *=. 5
ward = agglomerativeclustering (n_clusters=6, linkage= ' Ward '). Fit (X)
label = ward.labels_# get lable value
Fig = Plt.figure ()
ax = p3. Axes3d (Fig)
Ax.view_init (7, -80) for
L in Np.unique (label):
ax.scatter (X[label = = L, 0], X[label = = L, 1], X[label = = L, 2],
Color=plt.cm.jet (Np.float (L)/np.max (label + 1)),
s=20, edgecolor= ' K ')
plt.show ()
It can be seen that the structure of the data itself is neglected without the connection constraint, thus forming different folds across the manifold.
From sklearn.neighbors import kneighbors_graph
connectivity = Kneighbors_graph (X, n_neighbors=10, include_self= False)
ward = agglomerativeclustering (n_clusters=6, connectivity=connectivity,
linkage= ' Ward '). Fit (X)
After modifying some of the code, add connectivity to get good results.