Hierarchical clustering algorithm is a widely used algorithm, small series to do comparative experiments, to achieve one version, in order to verify the experimental results, combined with the distance between the provincial capitals, the province to see how the effect of clustering. All this article from 3 parts to introduce, first introduced the hierarchical clustering algorithm, then explained its implementation principle, finally combined with an example to analyze the comparison.
1. Hierarchical Clustering
Hierarchical clustering algorithms are divided into condensed and split, depending on whether the hierarchical decomposition is formed from bottom-up (merge) or top-down (split).
The condensed hierarchical clustering approach uses a bottom-up strategy that starts with each object being itself a separate class (n), and then constantly merging into ever larger classes until all objects are in a class, or a termination condition is met. In the process of merging, we find two recent classes that let them merge to form a class, so a maximum of n iterations will merge all the objects together. The hierarchical clustering method for splitting uses a top-down strategy that starts with all objects in one class (1), and then continuously divides them into smaller classes until the smallest class is condensed enough or contains only one object.
2. Principle of hierarchical clustering algorithm for condensation
Input: Given the N objects to be clustered and the distance matrix of the n*n (or the similarity matrix)
Steps:
1. Classify each object as a class, with a total of n classes, each containing only one object. The distance between classes and classes is the distance between the objects they contain.
2. Find the closest two classes and merge them into one class, so that the total number of classes is less.
3. Recalculate the distance between the new class and all old classes.
4. Repeat steps 2nd and 3rd until the final merge into a class (this class contains N objects) or meet certain conditional termination
Depending on the step 3, the hierarchical clustering method can be divided into single join algorithm (single-linkage), full join algorithm (complete-linkage), and Average-linkage. Where the single-connection algorithm uses the minimum distance, The full join algorithm uses the maximum distance, the average-linkage is the average distance. Defined as follows, where |p-p ' | is the distance between two objects or points p and P '.
3. Experimental results
According to the above-mentioned condensation hierarchy algorithm principle, the small series realizes a version with C + +, interested in can download the source code from my github. In order to test the effect of hierarchical clustering, the small series uses the distance of 32 cities in China as input, and uses single connection algorithm and full connection algorithm to cluster 32 provinces respectively. According to the large regional division, people generally divide our country into Chenghua, north, south, northwest, northeast, southwest and East China, a total of 7 parts. Small part of the experiment here is also gathered into 7 categories, to see the actual effect is not the same as we expected. Figure 1 Below is a single-connection algorithm experimental results, figure 2 is the full-join algorithm results. In order to visualize, small series will be the same class on the map with the same color, the result is as follows Figure 3 Figure 4.
Figure 1 Single-Connection algorithm results
Figure 2 Full join algorithm results
Figure 3 Single-connection algorithm map display effect
Figure 4 Full Join algorithm map display effect
Can see the full join algorithm effect is very good drop, may have questions, why we impression of the southwest, northwest these large areas of the discrepancy, one of the reasons is that the small part in here is based on the distance between the provincial capitals to the province to cluster, but the provincial capital city location is generally not in the central location of the province, all in and out. Figure 1 shows the single-connection algorithm results deviation is still quite large, in fact, this is the single-connection algorithm and the full-connection algorithm of the different places, the single-connection algorithm is easy to form a ribbon area, is the chain shape, we can see in Figure 3 is a large north-south direction of the region are red, and figure 4 of the full-