Hierarchical clustering algorithm of algorithm for clustering

Source: Internet
Author: User
first, the prototype clustering and hierarchical clustering

The prototype clustering is also called the prototype based clustering (prototype-based clustering), this kind of algorithm assumes that the cluster structure can initialize the prototype by a set of prototype, and then the prototype is iterative and updated. Different prototype representations and different solution methods are used to produce different algorithms. The common prototype clustering algorithm has K-means algorithm.

Hierarchical clustering (hierarchical clustering) is a prototype based clustering algorithm, which attempts to divide data sets at different levels to form a tree-shaped cluster structure. The data set can be divided into a "bottom-up" aggregation strategy, or a "top-down" split strategy can be adopted. The advantage of the hierarchical clustering algorithm is that we can draw a tree graph (dendrogram) to help us to interpret the clustering results using the visual method. Another advantage of hierarchical clustering is that it does not require the number of clusters to be specified beforehand. second, condensing hierarchical clustering

Hierarchical clustering can be divided into condensed (agglomerative) hierarchical clustering and split (divsive) hierarchical clustering. The split-hierarchy cluster uses the idea of "Top-down", which considers all samples to be of the same cluster, and then divides the clusters into smaller clusters by iteration until there is only one sample in each cluster. The condensed hierarchy cluster adopts the idea of "bottom-up", and each sample is considered to be a different cluster by repeating the last pair of clusters until all the samples belong to the same cluster.

In condensed hierarchy clustering, the two standard methods for determining the distance between clusters are single-linkage and fully-connected (complete linkage). A single connection, which calculates the distance of the most similar two samples in each pair of clusters, and merges the cluster from the nearest two samples. Full join, to complete the cluster merging by comparing to find the least-similar sample (farthest distance) in two clusters.


In addition to determining the distance between the two clusters through a single connection and a full connection, the condensed hierarchy clustering can be connected by an average connection (average linkage) and Ward. When using an average connection, the two clusters with the lowest average distance between all members of the two clusters are merged. With Ward connections, the two clusters that make SSE increments the smallest are merged. three, fully connected condensing hierarchical clustering

The cohesive hierarchical clustering based on full connection mainly includes the following steps:

1, get all the samples of the distance matrix

2, each data point as a separate cluster

3, based on the most dissimilar (farthest) sample distance, merging two closest clusters

4, update the sample distance matrix

5, repeat 2 to 4 until all samples belong to the same cluster.

The following is a representative to achieve a cohesive hierarchical clustering based on full connectivity 1, get samples

Randomly produces 5 samples, each containing 3 features (x, Y, z)

Import pandas as PD
import NumPy as np

if __name__ = "__main__":
    np.random.seed (1)
    #设置特征的名称
    variables = ["X", "Y", "Z"]
    #设置编号
    labels = ["S1", "S2", "S3", "S4", "S5"]
    #产生一个 (5,3) array
    data = Np.random.random_sample ([5,3]) *10
    #通过pandas将数组转换成一个DataFrame
    df = PD. Dataframe (data,columns=variables,index=labels)
    #查看数据
    print (DF)

2, get all the samples of the distance matrix

By SCIPY to compute the distance matrix, calculate the 22 Euclidean distance between each sample, and save the Matrix matrix with a dataframe for easy viewing

    From scipy.spatial.distance import pdist,squareform
    #获取距离矩阵
    '
    pdist: Calculates the Euclidean distance between 22 samples and returns a one-dimensional array
    Squareform: Converts an array to a symmetric matrix '
    Dist_matrix = PD. Dataframe (Squareform (Pdist (df,metric= "Euclidean")),
                               columns=labels,index=labels)
    print (Dist_matrix)

3, get the correlation matrix of the full join matrix

Through the linkage function of SCIPY, an association matrix (linkage) is obtained, which takes the whole connection as the criterion of distance determination.

    From scipy.cluster.hierarchy import linkage
    #以全连接作为距离判断标准, get an association matrix
    row_clusters = linkage (dist_ Matrix.values,method= "complete", metric= "Euclidean")
    #将关联矩阵转换成为一个DataFrame
    clusters = pd. Dataframe (row_clusters,columns=["label 1", "Label 2", "Distance", "sample Size"),
                            index=["cluster%d"% (i+1) for I In range (Row_clusters.shape[0])])
    print (clusters)

The first list is the number of the cluster, the second and third columns represent the least similar (farthest) number in the cluster, the fourth column represents the Euclidean distance of the sample, and the last column represents the number of samples in the cluster. 4, through the correlation matrix to draw the tree map

Using SciPy's dendrogram to draw a tree graph

    From scipy.cluster.hierarchy import dendrogram
    import matplotlib.pyplot as plt
    Row_dendr = Dendrogram (row_ Clusters,labels=labels)
    plt.tight_layout ()
    Plt.ylabel ("European Distance")
    plt.show ()


Through the tree diagram above, you can see intuitively. First, S1 and S5 merged, S2 and S3 merged, then S2, S3, S4 merged and finally merged with S1 and S5. 5, combined with tree-like and heat map

In practical application, we can use the tree graph and the heat-seeking method to represent the independent values in the sample matrix by different colors.

    #创建一个figure对象
    fig = plt.figure (figsize= (8,8))
    #设置x轴的位置, y-axis position, width of the tree, height of the tree
    Axd = fig.add_axes ([ 0.08,0.1,0.2,0.6])
    #将树状图旋转90度
    Row_dendr = Dendrogram (row_clusters,orientation= "left")
    # Initialize the data frame according to the cluster of the tree
    df_rowclust = df.ix[row_dendr["leaves"][::-1]
    #绘制热力图
    AXM = fig.add_axes ([ 0.1,0.1,0.6,0.6])
    CAx = axm.matshow (df_rowclust,interpolation= "nearest", cmap= "Hot_r")
    #删除树状图x轴和y轴的刻度
    axd.set_xticks ([])
    axd.set_yticks ([]) for
    i-axd.spines.values ():
        i.set_visible (False)
    Fig.colorbar (CAX)
    #设置热力图的x轴和y轴标记
    axm.set_xticklabels (['] + list (df_rowclust.columns))
    Axm.set_yticklabels (['] + list (df_rowclust.index))
    plt.show ()


Through the combination of thermal and tree graphs, it is more intuitive to find out the influence of the feature on the cluster division, and the different colors represent the size of the Euclidean distance. 6, using Sklearn to achieve cohesion clustering

    From sklearn.cluster import agglomerativeclustering
    '
    n_clusters: Setting the number of clusters
    linkage: Setting the decision criteria
    '
    AC = agglomerativeclustering (n_clusters=2,affinity= "Euclidean", linkage= "complete")
    labels = ac.fit_predict ( Data)
    print (labels)
    #[1 0 0 0 1]
Using Sklearn's agglomerativeclustering can easily achieve condensed clustering, the number of return clusters needs to be set.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.