Canopy Clustering algorithm

Source: Internet
Author: User

first, the concept

Unlike traditional clustering algorithms (such as K-means), the greatest feature of canopy clustering is that it is not necessary to specify the K value beforehand (that is, the number of clustering), so it has great practical application value. Compared with other clustering algorithms, canopy clustering, although the accuracy is low, but it has a great advantage in speed, so you can use the canopy cluster first to the data "coarse" clustering, get K value and then use K-means to further "fine" clustering. This Canopy+k-means hybrid clustering approach is divided into the following two steps:

Step1, clustering the most computationally expensive place is to calculate the similarity of objects, canopy clustering in the first phase of the selection of simple, low computational cost of the method to calculate the similarity of objects, the similar objects placed in a subset, this subset is called canopy, Some canopy,canopy can be overlapped by a series of computations, but there is no case that an object does not belong to any canopy, and this stage can be regarded as data preprocessing.

Step2, using traditional clustering methods (such as K-means) within each canopy , does not compute similarity between objects that do not belong to the same canopy.

From this method at least you can see two advantages: First, Canopy not too big and Canopy overlap between the words will greatly reduce the subsequent need to compute similarity of the number of objects; Secondly, a clustering method similar to K-means is required to artificially indicate the value of K, The number of canopy obtained by Stage1 can be used as the K value, which reduces the blindness of choosing K to some extent.

Second, the accuracy of the cluster

For traditional clustering, such as K-means, expectation-maximization, greedy agglomerative clustering, the similarity of an object to cluster is the distance from that point to the center of the cluster, The conditions in which the clustering accuracy can be well guaranteed are:

For each cluster there is a canopy, which contains all the elements that belong to this cluster.

If the measurement of this similarity is the distance from the nearest point of the current point to a cluster, then the clustering accuracy can be well guaranteed by:

There are several canopy for each cluster, and these canopy are connected by elements in the cluster (overlapping portions contain elements from cluster).

After the canopy partition of the dataset is complete, it resembles the following:

three, canopy algorithm flow

(1), the data set to quantify to get a list into memory, select two distance threshold: T1 and T2, wherein T1 > T2, corresponding, real coil for T1, the value of the dotted circle t2,t1 and T2 can be determined by cross-check;

(2), from the list any point P, using a low computational cost method to quickly calculate the distance between the point P and all canopy (if there is no current canopy, then the point P as a canopy), if the point P and a canopy distance within T1, The point P is added to the canopy;

(3), if the point P has been with a canopy distance within the T2, you need to remove the point p from the list, this step is that the point P at this time with this canopy is close enough, so it can not be the center of other canopy;

(4), repeat steps 2, 3 until the list is empty.

Canopy Clustering algorithm

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.