first, the concept
Unlike traditional clustering algorithms (such as K-means), the greatest feature of canopy clustering is that it is not necessary to specify the K value beforehand (that is, the number of clustering), so it has great practical application value. Compared with other clustering algorithms, canopy clustering, although the accuracy is low, but it has a great advantage in speed, so you can use the canopy cluster first to the data "coarse" clustering, get K value and then use K-means to further "fine" clustering. This Canopy+k-means hybrid clustering approach is divided into the following two steps:
Step1, clustering the most computationally expensive place is to calculate the similarity of objects, canopy clustering in the first phase of the selection of simple, low computational cost of the method to calculate the similarity of objects, the similar objects placed in a subset, this subset is called canopy, Some canopy,canopy can be overlapped by a series of computations, but there is no case that an object does not belong to any canopy, and this stage can be regarded as data preprocessing.
Step2, using traditional clustering methods (such as K-means) within each canopy , does not compute similarity between objects that do not belong to the same canopy.
From this method at least you can see two advantages: First, Canopy not too big and Canopy overlap between the words will greatly reduce the subsequent need to compute similarity of the number of objects; Secondly, a clustering method similar to K-means is required to artificially indicate the value of K, The number of canopy obtained by Stage1 can be used as the K value, which reduces the blindness of choosing K to some extent.
Second, the accuracy of the cluster
For traditional clustering, such as K-means, expectation-maximization, greedy agglomerative clustering, the similarity of an object to cluster is the distance from that point to the center of the cluster, The conditions in which the clustering accuracy can be well guaranteed are:
For each cluster there is a canopy, which contains all the elements that belong to this cluster.
If the measurement of this similarity is the distance from the nearest point of the current point to a cluster, then the clustering accuracy can be well guaranteed by:
There are several canopy for each cluster, and these canopy are connected by elements in the cluster (overlapping portions contain elements from cluster).
After the canopy partition of the dataset is complete, it resembles the following:
three, canopy algorithm flow
(1), the data set to quantify to get a list into memory, select two distance threshold: T1 and T2, wherein T1 > T2, corresponding, real coil for T1, the value of the dotted circle t2,t1 and T2 can be determined by cross-check;
(2), from the list any point P, using a low computational cost method to quickly calculate the distance between the point P and all canopy (if there is no current canopy, then the point P as a canopy), if the point P and a canopy distance within T1, The point P is added to the canopy;
(3), if the point P has been with a canopy distance within the T2, you need to remove the point p from the list, this step is that the point P at this time with this canopy is close enough, so it can not be the center of other canopy;
(4), repeat steps 2, 3 until the list is empty.
Canopy Clustering algorithm