Clustering algorithm: Aggregation Hierarchical clustering

Source: Internet
Author: User
Tags pow

Condensed Hierarchical Clustering :

The so-called condensed, refers to the algorithm initially, each point as a cluster, each step to merge the two closest cluster. In addition, even in the end, the noise point or outliers are often a cluster, unless excessive merger. For the "closest" here, there are three kinds of definitions. I am using MIN in the implementation, the method when merging, as long as the current nearest point pair, if the point pair is not currently in a cluster, will be in the same two clusters of the row:

(1) Single strand (MIN): Defines the distance between clusters of two closest points of different two clusters.

(2) Full chain (MAX): Defines the distance between clusters of two farthest points of different two clusters.

(3) Group averaging: the definition of the cluster's proximity is the average of all point-to-adjacency degrees taken from two different clusters.

According to the algorithm, the following code is implemented. Calculates the distance from each point pair at the beginning and merges in descending order by distance. In addition, in order to prevent excessive merging, the defined exit condition is that the cluster of 90% is merged, that is, the current cluster number is the 10% of the initial cluster :

The implementation code is as follows:

[Python]View PlainCopy
  1. # Scoding=utf-8
  2. # Agglomerative Hierarchical Clustering (AHC)
  3. Import Pylab as Pl
  4. From operator Import Itemgetter
  5. From collections Import Ordereddict,counter
  6. points = [Int (Eachpoint.split (' # ') [0]), int (Eachpoint.split (' # ') [1])] for eachpoint in open ( "Points","R")]
  7. # Initially, each point is assigned a single cluster
  8. groups = [idx for idx in range (len (points))]
  9. # Calculate the distance between each point pair
  10. disp2p = {}
  11. For idx1,point1 in Enumerate (points):
  12. For Idx2,point2 in Enumerate (points):
  13. if (Idx1 < IDX2):
  14. Distance = POW (ABS (point1[0]-point2[0]),2) + POW (ABS (point1[1]-point2[1]),2)
  15. Disp2p[str (idx1) +"#" +STR (idx2)] = distance
  16. # Sort individual point pairs by distance descending
  17. disp2p = ordereddict (sorted (Disp2p.iteritems (), Key=itemgetter (1), reverse=True))
  18. # Current number of clusters
  19. Groupnum = len (groups)
  20. # The effect of excessive merging will be brought into the noise point, when the cluster number is reduced to finalgroupnum, stop merging
  21. finalgroupnum = Int (groupnum*0.1)
  22. While Groupnum > Finalgroupnum:
  23. # Select the nearest point pair for the next distance
  24. Twopoins,distance = Disp2p.popitem ()
  25. Pointa = Int (Twopoins.split (' # ') [0])
  26. POINTB = Int (Twopoins.split (' # ') [1])
  27. Pointagroup = Groups[pointa]
  28. Pointbgroup = Groups[pointb]
  29. # The current distance from the nearest two points if not in the same cluster, all points in the cluster where point B is in the cluster where A is located, the current number of clusters minus 1
  30. if (pointagroup! = Pointbgroup):
  31. For idx in range (len (groups)):
  32. if groups[idx] = = Pointbgroup:
  33. GROUPS[IDX] = Pointagroup
  34. Groupnum-= 1
  35. # Select the largest 3 clusters, other clusters are classified as noise points
  36. Wantgroupnum = 3
  37. Finalgroup = Counter (groups). Most_common (Wantgroupnum)
  38. Finalgroup = [onecount[0] for onecount in finalgroup]
  39. Droppoints = [Points[idx] for idx in range (len (points)) if GROUPS[IDX] not in Finalgroup] /c2>
  40. # Print the largest number of points in a 3 cluster
  41. Group1 = [Points[idx] for idx in xrange (len (points)) if groups[idx]==finalgroup[0]]
  42. Group2 = [Points[idx] for idx in xrange (len (points)) if groups[idx]==finalgroup[1]]
  43. Group3 = [Points[idx] for idx in xrange (len (points)) if groups[idx]==finalgroup[2]]
  44. Pl.plot ([eachpoint[0] for eachpoint in group1], [eachpoint[1] for eachpoint in group1], ' or ')
  45. Pl.plot ([eachpoint[0] for eachpoint in group2], [eachpoint[1] for eachpoint in group2], ' Oy ' )  
  46. Pl.plot ([eachpoint[0] for eachpoint in Group3], [eachpoint[1] for eachpoint in Group3], ' og ')
  47. # Print noise point, black
  48. Pl.plot ([eachpoint[0] for eachpoint in droppoints], [eachpoint[1] for eachpoint in droppoints], ' OK ')
  49. Pl.show ()

In addition, we can see that the hierarchical clustering of condensation does not have a global objective function similar to the basic K-mean value, there is no local minimum problem or it is difficult to select the initial point problem. The merge operation is often final and will not be revoked once the two clusters have been merged. Of course, the cost of calculating storage is expensive.

Clustering algorithm: Aggregation Hierarchical clustering

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.