first, the prototype clustering and hierarchical clustering
The prototype clustering is also called the prototype based clustering (prototype-based clustering), this kind of algorithm assumes that the cluster structure can initialize the prototype by a set of prototype, and then the prototype is iterative and updated. Different prototype representations and different solution methods are used to produce different algorithms. The common prototype clustering algorithm has K-means algorithm.
Hierarchical clustering (hierarchical clustering) is a prototype based clustering algorithm, which attempts to divide data sets at different levels to form a tree-shaped cluster structure. The data set can be divided into a "bottom-up" aggregation strategy, or a "top-down" split strategy can be adopted. The advantage of the hierarchical clustering algorithm is that we can draw a tree graph (dendrogram) to help us to interpret the clustering results using the visual method. Another advantage of hierarchical clustering is that it does not require the number of clusters to be specified beforehand. second, condensing hierarchical clustering
Hierarchical clustering can be divided into condensed (agglomerative) hierarchical clustering and split (divsive) hierarchical clustering. The split-hierarchy cluster uses the idea of "Top-down", which considers all samples to be of the same cluster, and then divides the clusters into smaller clusters by iteration until there is only one sample in each cluster. The condensed hierarchy cluster adopts the idea of "bottom-up", and each sample is considered to be a different cluster by repeating the last pair of clusters until all the samples belong to the same cluster.
In condensed hierarchy clustering, the two standard methods for determining the distance between clusters are single-linkage and fully-connected (complete linkage). A single connection, which calculates the distance of the most similar two samples in each pair of clusters, and merges the cluster from the nearest two samples. Full join, to complete the cluster merging by comparing to find the least-similar sample (farthest distance) in two clusters.
In addition to determining the distance between the two clusters through a single connection and a full connection, the condensed hierarchy clustering can be connected by an average connection (average linkage) and Ward. When using an average connection, the two clusters with the lowest average distance between all members of the two clusters are merged. With Ward connections, the two clusters that make SSE increments the smallest are merged. three, fully connected condensing hierarchical clustering
The cohesive hierarchical clustering based on full connection mainly includes the following steps:
1, get all the samples of the distance matrix
2, each data point as a separate cluster
3, based on the most dissimilar (farthest) sample distance, merging two closest clusters
4, update the sample distance matrix
5, repeat 2 to 4 until all samples belong to the same cluster.
The following is a representative to achieve a cohesive hierarchical clustering based on full connectivity 1, get samples
Randomly produces 5 samples, each containing 3 features (x, Y, z)
Import pandas as PD
import NumPy as np
if __name__ = "__main__":
np.random.seed (1)
#设置特征的名称
variables = ["X", "Y", "Z"]
#设置编号
labels = ["S1", "S2", "S3", "S4", "S5"]
#产生一个 (5,3) array
data = Np.random.random_sample ([5,3]) *10
#通过pandas将数组转换成一个DataFrame
df = PD. Dataframe (data,columns=variables,index=labels)
#查看数据
print (DF)
2, get all the samples of the distance matrix
By SCIPY to compute the distance matrix, calculate the 22 Euclidean distance between each sample, and save the Matrix matrix with a dataframe for easy viewing
From scipy.spatial.distance import pdist,squareform
#获取距离矩阵
'
pdist: Calculates the Euclidean distance between 22 samples and returns a one-dimensional array
Squareform: Converts an array to a symmetric matrix '
Dist_matrix = PD. Dataframe (Squareform (Pdist (df,metric= "Euclidean")),
columns=labels,index=labels)
print (Dist_matrix)
3, get the correlation matrix of the full join matrix
Through the linkage function of SCIPY, an association matrix (linkage) is obtained, which takes the whole connection as the criterion of distance determination.
From scipy.cluster.hierarchy import linkage
#以全连接作为距离判断标准, get an association matrix
row_clusters = linkage (dist_ Matrix.values,method= "complete", metric= "Euclidean")
#将关联矩阵转换成为一个DataFrame
clusters = pd. Dataframe (row_clusters,columns=["label 1", "Label 2", "Distance", "sample Size"),
index=["cluster%d"% (i+1) for I In range (Row_clusters.shape[0])])
print (clusters)
The first list is the number of the cluster, the second and third columns represent the least similar (farthest) number in the cluster, the fourth column represents the Euclidean distance of the sample, and the last column represents the number of samples in the cluster. 4, through the correlation matrix to draw the tree map
Using SciPy's dendrogram to draw a tree graph
From scipy.cluster.hierarchy import dendrogram
import matplotlib.pyplot as plt
Row_dendr = Dendrogram (row_ Clusters,labels=labels)
plt.tight_layout ()
Plt.ylabel ("European Distance")
plt.show ()
Through the tree diagram above, you can see intuitively. First, S1 and S5 merged, S2 and S3 merged, then S2, S3, S4 merged and finally merged with S1 and S5. 5, combined with tree-like and heat map
In practical application, we can use the tree graph and the heat-seeking method to represent the independent values in the sample matrix by different colors.
#创建一个figure对象
fig = plt.figure (figsize= (8,8))
#设置x轴的位置, y-axis position, width of the tree, height of the tree
Axd = fig.add_axes ([ 0.08,0.1,0.2,0.6])
#将树状图旋转90度
Row_dendr = Dendrogram (row_clusters,orientation= "left")
# Initialize the data frame according to the cluster of the tree
df_rowclust = df.ix[row_dendr["leaves"][::-1]
#绘制热力图
AXM = fig.add_axes ([ 0.1,0.1,0.6,0.6])
CAx = axm.matshow (df_rowclust,interpolation= "nearest", cmap= "Hot_r")
#删除树状图x轴和y轴的刻度
axd.set_xticks ([])
axd.set_yticks ([]) for
i-axd.spines.values ():
i.set_visible (False)
Fig.colorbar (CAX)
#设置热力图的x轴和y轴标记
axm.set_xticklabels (['] + list (df_rowclust.columns))
Axm.set_yticklabels (['] + list (df_rowclust.index))
plt.show ()
Through the combination of thermal and tree graphs, it is more intuitive to find out the influence of the feature on the cluster division, and the different colors represent the size of the Euclidean distance. 6, using Sklearn to achieve cohesion clustering
From sklearn.cluster import agglomerativeclustering
'
n_clusters: Setting the number of clusters
linkage: Setting the decision criteria
'
AC = agglomerativeclustering (n_clusters=2,affinity= "Euclidean", linkage= "complete")
labels = ac.fit_predict ( Data)
print (labels)
#[1 0 0 0 1]
Using Sklearn's agglomerativeclustering can easily achieve condensed clustering, the number of return clusters needs to be set.