Hierarchical Clustering algorithm

Source: Internet
Author: User
Tags join

Hierarchical clustering algorithm is a widely used algorithm, small series to do comparative experiments, to achieve one version, in order to verify the experimental results, combined with the distance between the provincial capitals, the province to see how the effect of clustering. All this article from 3 parts to introduce, first introduced the hierarchical clustering algorithm, then explained its implementation principle, finally combined with an example to analyze the comparison.


1. Hierarchical Clustering

Hierarchical clustering algorithms are divided into condensed and split, depending on whether the hierarchical decomposition is formed from bottom-up (merge) or top-down (split).

The condensed hierarchical clustering approach uses a bottom-up strategy that starts with each object being itself a separate class (n), and then constantly merging into ever larger classes until all objects are in a class, or a termination condition is met. In the process of merging, we find two recent classes that let them merge to form a class, so a maximum of n iterations will merge all the objects together. The hierarchical clustering method for splitting uses a top-down strategy that starts with all objects in one class (1), and then continuously divides them into smaller classes until the smallest class is condensed enough or contains only one object.


2. Principle of hierarchical clustering algorithm for condensation

Input: Given the N objects to be clustered and the distance matrix of the n*n (or the similarity matrix)

Steps:
1. Classify each object as a class, with a total of n classes, each containing only one object. The distance between classes and classes is the distance between the objects they contain.

2. Find the closest two classes and merge them into one class, so that the total number of classes is less.
3. Recalculate the distance between the new class and all old classes.
4. Repeat steps 2nd and 3rd until the final merge into a class (this class contains N objects) or meet certain conditional termination

Depending on the step 3, the hierarchical clustering method can be divided into single join algorithm (single-linkage), full join algorithm (complete-linkage), and Average-linkage. Where the single-connection algorithm uses the minimum distance, The full join algorithm uses the maximum distance, the average-linkage is the average distance. Defined as follows, where |p-p ' | is the distance between two objects or points p and P '.

3. Experimental results

According to the above-mentioned condensation hierarchy algorithm principle, the small series realizes a version with C + +, interested in can download the source code from my github. In order to test the effect of hierarchical clustering, the small series uses the distance of 32 cities in China as input, and uses single connection algorithm and full connection algorithm to cluster 32 provinces respectively. According to the large regional division, people generally divide our country into Chenghua, north, south, northwest, northeast, southwest and East China, a total of 7 parts. Small part of the experiment here is also gathered into 7 categories, to see the actual effect is not the same as we expected. Figure 1 Below is a single-connection algorithm experimental results, figure 2 is the full-join algorithm results. In order to visualize, small series will be the same class on the map with the same color, the result is as follows Figure 3 Figure 4.


Figure 1 Single-Connection algorithm results

Figure 2 Full join algorithm results

Figure 3 Single-connection algorithm map display effect

Figure 4 Full Join algorithm map display effect

Can see the full join algorithm effect is very good drop, may have questions, why we impression of the southwest, northwest these large areas of the discrepancy, one of the reasons is that the small part in here is based on the distance between the provincial capitals to the province to cluster, but the provincial capital city location is generally not in the central location of the province, all in and out. Figure 1 shows the single-connection algorithm results deviation is still quite large, in fact, this is the single-connection algorithm and the full-connection algorithm of the different places, the single-connection algorithm is easy to form a ribbon area, is the chain shape, we can see in Figure 3 is a large north-south direction of the region are red, and figure 4 of the full-


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.