Condensed Hierarchical Clustering :
The so-called condensed, refers to the algorithm initially, each point as a cluster, each step to merge the two closest cluster. In addition, even in the end, the noise point or outliers are often a cluster, unless excessive merger. For the "closest" here, there are three kinds of definitions. I am using MIN in the implementation, the method when merging, as long as the current nearest point pair, if the point pair is not currently in a cluster, will be in the same two clusters of the row:
(1) Single strand (MIN): Defines the distance between clusters of two closest points of different two clusters.
(2) Full chain (MAX): Defines the distance between clusters of two farthest points of different two clusters.
(3) Group averaging: the definition of the cluster's proximity is the average of all point-to-adjacency degrees taken from two different clusters.
According to the algorithm, the following code is implemented. Calculates the distance from each point pair at the beginning and merges in descending order by distance. In addition, in order to prevent excessive merging, the defined exit condition is that the cluster of 90% is merged, that is, the current cluster number is the 10% of the initial cluster :
The implementation code is as follows:
[Python]View PlainCopy
- # Scoding=utf-8
- # Agglomerative Hierarchical Clustering (AHC)
- Import Pylab as Pl
- From operator Import Itemgetter
- From collections Import Ordereddict,counter
- points = [Int (Eachpoint.split (' # ') [0]), int (Eachpoint.split (' # ') [1])] for eachpoint in open ( "Points","R")]
- # Initially, each point is assigned a single cluster
- groups = [idx for idx in range (len (points))]
- # Calculate the distance between each point pair
- disp2p = {}
- For idx1,point1 in Enumerate (points):
- For Idx2,point2 in Enumerate (points):
- if (Idx1 < IDX2):
- Distance = POW (ABS (point1[0]-point2[0]),2) + POW (ABS (point1[1]-point2[1]),2)
- Disp2p[str (idx1) +"#" +STR (idx2)] = distance
- # Sort individual point pairs by distance descending
- disp2p = ordereddict (sorted (Disp2p.iteritems (), Key=itemgetter (1), reverse=True))
- # Current number of clusters
- Groupnum = len (groups)
- # The effect of excessive merging will be brought into the noise point, when the cluster number is reduced to finalgroupnum, stop merging
- finalgroupnum = Int (groupnum*0.1)
- While Groupnum > Finalgroupnum:
- # Select the nearest point pair for the next distance
- Twopoins,distance = Disp2p.popitem ()
- Pointa = Int (Twopoins.split (' # ') [0])
- POINTB = Int (Twopoins.split (' # ') [1])
- Pointagroup = Groups[pointa]
- Pointbgroup = Groups[pointb]
- # The current distance from the nearest two points if not in the same cluster, all points in the cluster where point B is in the cluster where A is located, the current number of clusters minus 1
- if (pointagroup! = Pointbgroup):
- For idx in range (len (groups)):
- if groups[idx] = = Pointbgroup:
- GROUPS[IDX] = Pointagroup
- Groupnum-= 1
- # Select the largest 3 clusters, other clusters are classified as noise points
- Wantgroupnum = 3
- Finalgroup = Counter (groups). Most_common (Wantgroupnum)
- Finalgroup = [onecount[0] for onecount in finalgroup]
- Droppoints = [Points[idx] for idx in range (len (points)) if GROUPS[IDX] not in Finalgroup] /c2>
- # Print the largest number of points in a 3 cluster
- Group1 = [Points[idx] for idx in xrange (len (points)) if groups[idx]==finalgroup[0]]
- Group2 = [Points[idx] for idx in xrange (len (points)) if groups[idx]==finalgroup[1]]
- Group3 = [Points[idx] for idx in xrange (len (points)) if groups[idx]==finalgroup[2]]
- Pl.plot ([eachpoint[0] for eachpoint in group1], [eachpoint[1] for eachpoint in group1], ' or ')
- Pl.plot ([eachpoint[0] for eachpoint in group2], [eachpoint[1] for eachpoint in group2], ' Oy ' )
- Pl.plot ([eachpoint[0] for eachpoint in Group3], [eachpoint[1] for eachpoint in Group3], ' og ')
- # Print noise point, black
- Pl.plot ([eachpoint[0] for eachpoint in droppoints], [eachpoint[1] for eachpoint in droppoints], ' OK ')
- Pl.show ()
In addition, we can see that the hierarchical clustering of condensation does not have a global objective function similar to the basic K-mean value, there is no local minimum problem or it is difficult to select the initial point problem. The merge operation is often final and will not be revoked once the two clusters have been merged. Of course, the cost of calculating storage is expensive.
Clustering algorithm: Aggregation Hierarchical clustering