Canopy algorithm to compute cluster number of clusters

Source: Internet
Author: User

Kmeans is a classical algorithm in clustering, and the process is as follows:
Select K points as the initial centroid
Repeat
Assigns each point to the nearest centroid, forming a K-cluster
Recalculate the center of mass of each cluster
Until clusters do not change or reach the maximum number of iterations

The k in the algorithm needs to be artificially specified. There are a number of ways to determine k, such as multiple trials, calculation errors, the best K. This will take a long time. We can roughly determine the K value (which can be considered equal) according to the canopy algorithm. Look at the process of the canopy algorithm:

(1) Set the sample set to S, determine two thresholds T1 and T2, and t1>t2.
(2) To take a sample point P, as a canopy, recorded as C, remove p from S.
(3) Calculate the distance of all points to P in s Dist
(4) If the DIST<T1, then the corresponding point to C, as a weak association.
(5) If dist<t2, the corresponding point is moved out of S, as a strong association.
(6) Repeat (2) ~ (5) until S is empty.

The number of canopy can be used as the K value and the blindness of selection k is reduced to some extent. The following canopy algorithm for some points to calculate the number of canopy, if only the K value, then T1 has no effect, the use of designated T2 can be used here, the average distance of all points as a T2.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21st
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
Package cn.edu.ustc.dm.cluster;

Import java.util.ArrayList;
Import java.util.List;

Import Cn.edu.ustc.dm.bean.Point;

/**
* Canopy algorithm calculates the K value in corresponding Kmeans with the help of canopy algorithm
* Which for the calculation of K value, the canopy algorithm T1 meaningless, only with the set T2 (T1&GT;T2) Here we will T2 set to the average distance
*
* @author YD
*
*/
public class Canopy {
Private list<point> points = new arraylist<point> (); The point of clustering
Private list<list<point>> clusters = new arraylist<list<point>> (); Storage Cluster
Private double T2 =-1; Threshold value

Public canopy (list<point> points) {
for (Point point:points)
Make a deep copy
This.points.add (point);
}

/**
* Clustering, according to the canopy algorithm to calculate, all the points to cluster
*/
public void cluster () {
T2 = getaveragedistance (points);
while (Points.size ()!= 0) {
list<point> cluster = new arraylist<point> ();
Point basepoint = points.get (0); Datum points
Cluster.add (Basepoint);
Points.remove (0);
int index = 0;
while (Index < points.size ()) {
Point anotherpoint = Points.get (index);
Double distance = math.sqrt ((basepoint.x-anotherpoint.x)
* (Basepoint.x-anotherpoint.x)
+ (BASEPOINT.Y-ANOTHERPOINT.Y)
* (BASEPOINT.Y-ANOTHERPOINT.Y));
if (distance <= T2) {
Cluster.add (Anotherpoint);
Points.remove (index);
} else {
index++;
}
}
Clusters.add (cluster);
}
}

/**
* Number of cluster received
*
* Number of @return
*/
public int Getclusternumber () {
return Clusters.size ();
}

/**
* Get the cluster corresponding to the center point (each point added to the average)
*
* @return
*/
Public list<point> getclustercenterpoints () {
list<point> centerpoints = new arraylist<point> ();
for (list<point> cluster:clusters) {
Centerpoints.add (Getcenterpoint (cluster));
}
return centerpoints;
}

/**
* The resulting center point (the sum of each point is averaged)
*
* @return return to the center point
*/
Private double getaveragedistance (list<point> points) {
Double sum = 0;
int pointsize = Points.size ();
for (int i = 0; i < pointsize; i++) {
for (int j = 0; J < Pointsize; J + +) {
if (i = = j)
Continue
Point Pointa = Points.get (i);
Point pointb = Points.get (j);
Sum + + math.sqrt ((pointa.x-pointb.x) * (pointa.x-pointb.x)
+ (POINTA.Y-POINTB.Y) * (POINTA.Y-POINTB.Y));
}
}
int distancenumber = pointsize * (pointsize + 1)/2;
Double T2 = SUM/DISTANCENUMBER/2; Half of the average distance
return T2;
}

/**
* The resulting center point (the sum of each point is averaged)
*
* @return return to the center point
*/
Private Point Getcenterpoint (list<point> points) {
Double sumx = 0;
Double SumY = 0;
for (point point:points) {
Sumx + = Point.x;
SumY + = Point.y;
}
int clustersize = Points.size ();
Point centerpoint = new Point (Sumx/clustersize, sumy/clustersize);
return centerpoint;
}

/**
* Get the threshold value T2
*
* @return Threshold value T2
*/
Public double Getthreshold () {
return T2;
}

/**
* Test 9 points for operation
* @param args
*/
public static void Main (string[] args) {
List<point> points = new arraylist<point> ();
Points.Add (new point (0, 0));
Points.Add (new Point (0, 1));
Points.Add (New Point (1, 0));

Points.Add (New Point (5, 5));
Points.Add (New Point (5, 6));
Points.Add (New Point (6, 5));

Points.Add (New Point (10, 2));
Points.Add (New Point (10, 3));
Points.Add (New Point (11, 3));

Canopy canopy = new canopy (points);
Canopy.cluster ();

Get Number of canopy
int clusternumber = Canopy.getclusternumber ();
System.out.println (Clusternumber);

Gets the value of T2 in canopy
System.out.println (Canopy.getthreshold ());
}
}

The above code is to 9 points using the canopy algorithm to calculate, get canopy number, also known as K.

More articles please go to Xiao Fat Xuan.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.