Introduction to canopy Algorithm in mahout

Last Update:2018-12-03 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

K-means clusteringAlgorithmThe biggest advantage is: the principle is simple, and the implementation is relatively simple. At the same time, the execution efficiency and scalability for large data volumes are still relatively strong. However, the disadvantage is also very clear. First, users need to have a clear setting of the number of clusters K before performing clustering. This is not likely to be known in advance when dealing with most problems, generally, an optimal K value needs to be obtained through multiple experiments. Secondly, because the algorithm selects the initial cluster center randomly at the beginning, therefore, algorithms have poor tolerance for noise and isolated points. The so-called noise is the wrong data in the objects to be clustered, while the isolated point is the data that is far away from other data and has low similarity. For the K-means algorithm, once the isolated points and noise are selected as the cluster center at the beginning, the whole clustering process will be very problematic, how can we quickly find the number of clusters to be selected and find the center of the cluster, which can be greatly optimized?
The efficiency of the K-means clustering algorithm. Next we will introduce another clustering method: canopy clustering algorithm.

The canopy clustering algorithm is a simple, fast, and accurate method for grouping objects into classes. Each object is represented by a point in a multi-dimensional feature space. This algorithm uses a fast approximate distance measurement and two distance thresholds T1> T2 for processing. The basic algorithm is to create a canopy that contains the store from a vertex set and randomly delete it, and iterate over the remaining vertex set. For each vertex, if its distance from the first vertex is smaller than T1, then the vertex is added to the aggregation. If the distance is less than t2, delete the vertex from the collection. This will avoid any future processing if it is very close to the origin. This algorithm loops until the initial set is empty and aggregates the canopies of a set, each of which can contain one or more vertices. Each vertex can be contained in more than one canopy.

Canopy clustering is often used as an initial step for stricter clustering techniques, such as K-means clustering. Through an initial clustering, You can significantly reduce the number of consumed distance measurements by ignoring the points of the initial canopies.

The canopy clustering algorithm is often used to pre-process the K-means clustering algorithm to find the appropriate K-value and cluster center.

The biggest problem with K-means is that users must give the number of K in advance. K selection is generally based on experience values and multiple experiment results. For different datasets, K values are not authenticated. In addition, the K-means is sensitive to "noise" and isolated point data, and a small amount of such data can have a great impact on the average value.

Although the reference of the canopy clustering algorithm effectively solves the problem of the number of K selected initially, we can see that at Org. apache. mahout. clustering. syntheticcontrol. kmeans. in the job, canopydriver is called first. run method to select K and center, and then call kmeansdriver. run Method for kmeans clustering.

However, the introduction of the canopy algorithm will bring about new problems. For the canopy algorithm, the selection of the threshold T1 and T2 is another major problem. The relevant information cannot be found on the Internet, which is a headache...

References:

1. http://blog.csdn.net/airinsoul/article/details/6659647
2. http://blog.sina.com.cn/s/blog_62a9902f0100mr27.html
3. https://cwiki.apache.org/MAHOUT/canopy-clustering.html
4. mahout_in_action

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Introduction to canopy Algorithm in mahout

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Introduction to canopy Algorithm in mahout

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support