Hierarchical clustering (hierarchical clustering)

Source: Internet
Author: User

Hierarchical Clustering Principle:

Well. Sort the diagram. Divide and conquer. Yes, unlike prototype clustering and density clustering, hierarchical clustering attempts to partition sample datasets on different "levels", clustering them one layer at a a-level. The partitioning strategy can be divided into bottom-up condensation methods (agglomerative hierarchical clustering), such as AGNES. From the upper-downward division method (divisive hierarchical clustering), such as DIANA.

AGNES Each point of all samples as a cluster, and then finds the two clusters with the smallest distance to merge, repeating to the expected cluster or other termination conditions.
DIANA First treats all the samples as a whole cluster, and then finds the two clusters that are farthest from the cluster to split, repeating to the expected cluster or other termination conditions.

how to determine the distance between the two cluster.
1. Minimum distance, single Linkage
A recent sample of two clusters is determined.
2. Maximum distance, full link complete Linkage
The farthest sample of two clusters is determined.
3. Average distance, all links average Linkage
All samples of two clusters are determined together.
Both 1 and 2 are susceptible to extreme values, and 3 of this method is computationally significant, but the measure is more reasonable.

Similar to decision trees, hierarchical clustering has the advantage of being able to get the whole tree at once, and controlling certain conditions, whether depth or width, is controllable, but it has many problems:

Calculation Amount
The division determines that no further changes can be made
Cohesion and division are combined with each other.

Select "Best" each time
Greedy algorithm, easy to local optimization, can be done by the appropriate random operation.
or using a balanced iteration protocol and clustering (Balanced iterative reducing and clustering Using hierarchies,BIRCH) It first divides the adjacent sample points into tiny clusters ( Microcluseters), and then use the K-means algorithm for these clusters.

HC Applications:
Agglomerativeclustering parameter Description:
agglomerativeclustering (affinity= ' Euclidean ', compute_full_tree= ' auto ', Connectivity=none, linkage= ' Ward ', Memory=memory (Cachedir=none), n_clusters=6,pooling_func=)

Affinity= ' Euclidean ': Distance metric
Connectivity: Is there a connectivity constraint
linkage= ' ward ': Link way
Memory: storage mode
n_clusters= 6: Number of cluster classes
Import NumPy as NP
import Matplotlib.pyplot as Plt
import mpl_toolkits.mplot3d.axes3d as P3
from Sklearn.cluster import agglomerativeclustering from
sklearn.datasets.samples_generator import Make_swiss_roll

N_samples =
noise = 0.05
X, _ = Make_swiss_roll (N_samples, noise) #卷型数据集
#进行放缩
x[:, 1] *=. 5
  
   ward = agglomerativeclustering (n_clusters=6, linkage= ' Ward '). Fit (X)
label = ward.labels_# get lable value

Fig = Plt.figure ()
ax = p3.  Axes3d (Fig)
Ax.view_init (7, -80) for
L in Np.unique (label):
    ax.scatter (X[label = = L, 0], X[label = = L, 1], X[label = = L, 2],
               Color=plt.cm.jet (Np.float (L)/np.max (label + 1)),
               s=20, edgecolor= ' K ')
plt.show ()

  

It can be seen that the structure of the data itself is neglected without the connection constraint, thus forming different folds across the manifold.

From sklearn.neighbors import kneighbors_graph
connectivity = Kneighbors_graph (X, n_neighbors=10, include_self= False)
ward = agglomerativeclustering (n_clusters=6, connectivity=connectivity,
                               linkage= ' Ward '). Fit (X)

After modifying some of the code, add connectivity to get good results.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.