Canopy Clustering algorithm

Last Update:2017-05-13 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

first, the concept

Unlike traditional clustering algorithms (such as K-means), the greatest feature of canopy clustering is that it is not necessary to specify the K value beforehand (that is, the number of clustering), so it has great practical application value. Compared with other clustering algorithms, canopy clustering, although the accuracy is low, but it has a great advantage in speed, so you can use the canopy cluster first to the data "coarse" clustering, get K value and then use K-means to further "fine" clustering. This Canopy+k-means hybrid clustering approach is divided into the following two steps:

Step1, clustering the most computationally expensive place is to calculate the similarity of objects, canopy clustering in the first phase of the selection of simple, low computational cost of the method to calculate the similarity of objects, the similar objects placed in a subset, this subset is called canopy, Some canopy,canopy can be overlapped by a series of computations, but there is no case that an object does not belong to any canopy, and this stage can be regarded as data preprocessing.

Step2, using traditional clustering methods (such as K-means) within each canopy , does not compute similarity between objects that do not belong to the same canopy.

From this method at least you can see two advantages: First, Canopy not too big and Canopy overlap between the words will greatly reduce the subsequent need to compute similarity of the number of objects; Secondly, a clustering method similar to K-means is required to artificially indicate the value of K, The number of canopy obtained by Stage1 can be used as the K value, which reduces the blindness of choosing K to some extent.

Second, the accuracy of the cluster

For traditional clustering, such as K-means, expectation-maximization, greedy agglomerative clustering, the similarity of an object to cluster is the distance from that point to the center of the cluster, The conditions in which the clustering accuracy can be well guaranteed are:

For each cluster there is a canopy, which contains all the elements that belong to this cluster.

If the measurement of this similarity is the distance from the nearest point of the current point to a cluster, then the clustering accuracy can be well guaranteed by:

There are several canopy for each cluster, and these canopy are connected by elements in the cluster (overlapping portions contain elements from cluster).

After the canopy partition of the dataset is complete, it resembles the following:

three, canopy algorithm flow

(1), the data set to quantify to get a list into memory, select two distance threshold: T1 and T2, wherein T1 > T2, corresponding, real coil for T1, the value of the dotted circle t2,t1 and T2 can be determined by cross-check;

(2), from the list any point P, using a low computational cost method to quickly calculate the distance between the point P and all canopy (if there is no current canopy, then the point P as a canopy), if the point P and a canopy distance within T1, The point P is added to the canopy;

(3), if the point P has been with a canopy distance within the T2, you need to remove the point p from the list, this step is that the point P at this time with this canopy is close enough, so it can not be the center of other canopy;

(4), repeat steps 2, 3 until the list is empty.

Canopy Clustering algorithm

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Canopy Clustering algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Canopy Clustering algorithm

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support