Clustering algorithm: Aggregation Hierarchical clustering

Last Update:2017-09-11 Source: Internet

Author: User

Tags pow

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Condensed Hierarchical Clustering :

The so-called condensed, refers to the algorithm initially, each point as a cluster, each step to merge the two closest cluster. In addition, even in the end, the noise point or outliers are often a cluster, unless excessive merger. For the "closest" here, there are three kinds of definitions. I am using MIN in the implementation, the method when merging, as long as the current nearest point pair, if the point pair is not currently in a cluster, will be in the same two clusters of the row:

(1) Single strand (MIN): Defines the distance between clusters of two closest points of different two clusters.

(2) Full chain (MAX): Defines the distance between clusters of two farthest points of different two clusters.

(3) Group averaging: the definition of the cluster's proximity is the average of all point-to-adjacency degrees taken from two different clusters.

According to the algorithm, the following code is implemented. Calculates the distance from each point pair at the beginning and merges in descending order by distance. In addition, in order to prevent excessive merging, the defined exit condition is that the cluster of 90% is merged, that is, the current cluster number is the 10% of the initial cluster :

The implementation code is as follows:

[Python]View PlainCopy

# Scoding=utf-8
# Agglomerative Hierarchical Clustering (AHC)
Import Pylab as Pl
From operator Import Itemgetter
From collections Import Ordereddict,counter
points = [Int (Eachpoint.split (' # ') [0]), int (Eachpoint.split (' # ') [1])] for eachpoint in open ( "Points","R")]
# Initially, each point is assigned a single cluster
groups = [idx for idx in range (len (points))]
# Calculate the distance between each point pair
disp2p = {}
For idx1,point1 in Enumerate (points):
For Idx2,point2 in Enumerate (points):
if (Idx1 < IDX2):
Distance = POW (ABS (point1[0]-point2[0]),2) + POW (ABS (point1[1]-point2[1]),2)
Disp2p[str (idx1) +"#" +STR (idx2)] = distance
# Sort individual point pairs by distance descending
disp2p = ordereddict (sorted (Disp2p.iteritems (), Key=itemgetter (1), reverse=True))
# Current number of clusters
Groupnum = len (groups)
# The effect of excessive merging will be brought into the noise point, when the cluster number is reduced to finalgroupnum, stop merging
finalgroupnum = Int (groupnum*0.1)
While Groupnum > Finalgroupnum:
# Select the nearest point pair for the next distance
Twopoins,distance = Disp2p.popitem ()
Pointa = Int (Twopoins.split (' # ') [0])
POINTB = Int (Twopoins.split (' # ') [1])
Pointagroup = Groups[pointa]
Pointbgroup = Groups[pointb]
# The current distance from the nearest two points if not in the same cluster, all points in the cluster where point B is in the cluster where A is located, the current number of clusters minus 1
if (pointagroup! = Pointbgroup):
For idx in range (len (groups)):
if groups[idx] = = Pointbgroup:
GROUPS[IDX] = Pointagroup
Groupnum-= 1
# Select the largest 3 clusters, other clusters are classified as noise points
Wantgroupnum = 3
Finalgroup = Counter (groups). Most_common (Wantgroupnum)
Finalgroup = [onecount[0] for onecount in finalgroup]
Droppoints = [Points[idx] for idx in range (len (points)) if GROUPS[IDX] not in Finalgroup] /c2>
# Print the largest number of points in a 3 cluster
Group1 = [Points[idx] for idx in xrange (len (points)) if groups[idx]==finalgroup[0]]
Group2 = [Points[idx] for idx in xrange (len (points)) if groups[idx]==finalgroup[1]]
Group3 = [Points[idx] for idx in xrange (len (points)) if groups[idx]==finalgroup[2]]
Pl.plot ([eachpoint[0] for eachpoint in group1], [eachpoint[1] for eachpoint in group1], ' or ')
Pl.plot ([eachpoint[0] for eachpoint in group2], [eachpoint[1] for eachpoint in group2], ' Oy ' )
Pl.plot ([eachpoint[0] for eachpoint in Group3], [eachpoint[1] for eachpoint in Group3], ' og ')
# Print noise point, black
Pl.plot ([eachpoint[0] for eachpoint in droppoints], [eachpoint[1] for eachpoint in droppoints], ' OK ')
Pl.show ()

In addition, we can see that the hierarchical clustering of condensation does not have a global objective function similar to the basic K-mean value, there is no local minimum problem or it is difficult to select the initial point problem. The merge operation is often final and will not be revoked once the two clusters have been merged. Of course, the cost of calculating storage is expensive.

Clustering algorithm: Aggregation Hierarchical clustering

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More