Kmeans is a classical algorithm in clustering, and the process is as follows:
Select K points as the initial centroid
Repeat
Assigns each point to the nearest centroid, forming a K-cluster
Recalculate the center of mass of each cluster
Until clusters do not change or reach the maximum number of iterations
The k in the algorithm needs to be artificially specified. There are a number of ways to determine k, such as multiple trials, calculation errors, the best K. This will take a long time. We can roughly determine the K value (which can be considered equal) according to the canopy algorithm. Look at the process of the canopy algorithm:
(1) Set the sample set to S, determine two thresholds T1 and T2, and t1>t2.
(2) To take a sample point P, as a canopy, recorded as C, remove p from S.
(3) Calculate the distance of all points to P in s Dist
(4) If the DIST<T1, then the corresponding point to C, as a weak association.
(5) If dist<t2, the corresponding point is moved out of S, as a strong association.
(6) Repeat (2) ~ (5) until S is empty.
The number of canopy can be used as the K value and the blindness of selection k is reduced to some extent. The following canopy algorithm for some points to calculate the number of canopy, if only the K value, then T1 has no effect, the use of designated T2 can be used here, the average distance of all points as a T2.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21st 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 |
Package cn.edu.ustc.dm.cluster; Import java.util.ArrayList; Import java.util.List; Import Cn.edu.ustc.dm.bean.Point; /** * Canopy algorithm calculates the K value in corresponding Kmeans with the help of canopy algorithm * Which for the calculation of K value, the canopy algorithm T1 meaningless, only with the set T2 (T1>T2) Here we will T2 set to the average distance * * @author YD * */ public class Canopy { Private list<point> points = new arraylist<point> (); The point of clustering Private list<list<point>> clusters = new arraylist<list<point>> (); Storage Cluster Private double T2 =-1; Threshold value Public canopy (list<point> points) { for (Point point:points) Make a deep copy This.points.add (point); } /** * Clustering, according to the canopy algorithm to calculate, all the points to cluster */ public void cluster () { T2 = getaveragedistance (points); while (Points.size ()!= 0) { list<point> cluster = new arraylist<point> (); Point basepoint = points.get (0); Datum points Cluster.add (Basepoint); Points.remove (0); int index = 0; while (Index < points.size ()) { Point anotherpoint = Points.get (index); Double distance = math.sqrt ((basepoint.x-anotherpoint.x) * (Basepoint.x-anotherpoint.x) + (BASEPOINT.Y-ANOTHERPOINT.Y) * (BASEPOINT.Y-ANOTHERPOINT.Y)); if (distance <= T2) { Cluster.add (Anotherpoint); Points.remove (index); } else { index++; } } Clusters.add (cluster); } } /** * Number of cluster received * * Number of @return */ public int Getclusternumber () { return Clusters.size (); } /** * Get the cluster corresponding to the center point (each point added to the average) * * @return */ Public list<point> getclustercenterpoints () { list<point> centerpoints = new arraylist<point> (); for (list<point> cluster:clusters) { Centerpoints.add (Getcenterpoint (cluster)); } return centerpoints; } /** * The resulting center point (the sum of each point is averaged) * * @return return to the center point */ Private double getaveragedistance (list<point> points) { Double sum = 0; int pointsize = Points.size (); for (int i = 0; i < pointsize; i++) { for (int j = 0; J < Pointsize; J + +) { if (i = = j) Continue Point Pointa = Points.get (i); Point pointb = Points.get (j); Sum + + math.sqrt ((pointa.x-pointb.x) * (pointa.x-pointb.x) + (POINTA.Y-POINTB.Y) * (POINTA.Y-POINTB.Y)); } } int distancenumber = pointsize * (pointsize + 1)/2; Double T2 = SUM/DISTANCENUMBER/2; Half of the average distance return T2; } /** * The resulting center point (the sum of each point is averaged) * * @return return to the center point */ Private Point Getcenterpoint (list<point> points) { Double sumx = 0; Double SumY = 0; for (point point:points) { Sumx + = Point.x; SumY + = Point.y; } int clustersize = Points.size (); Point centerpoint = new Point (Sumx/clustersize, sumy/clustersize); return centerpoint; } /** * Get the threshold value T2 * * @return Threshold value T2 */ Public double Getthreshold () { return T2; }
/** * Test 9 points for operation * @param args */ public static void Main (string[] args) { List<point> points = new arraylist<point> (); Points.Add (new point (0, 0)); Points.Add (new Point (0, 1)); Points.Add (New Point (1, 0)); Points.Add (New Point (5, 5)); Points.Add (New Point (5, 6)); Points.Add (New Point (6, 5)); Points.Add (New Point (10, 2)); Points.Add (New Point (10, 3)); Points.Add (New Point (11, 3)); Canopy canopy = new canopy (points); Canopy.cluster (); Get Number of canopy int clusternumber = Canopy.getclusternumber (); System.out.println (Clusternumber); Gets the value of T2 in canopy System.out.println (Canopy.getthreshold ()); } } |
The above code is to 9 points using the canopy algorithm to calculate, get canopy number, also known as K.
More articles please go to Xiao Fat Xuan.