Introduction to canopy Algorithm in mahout

Source: Internet
Author: User

K-means clusteringAlgorithmThe biggest advantage is: the principle is simple, and the implementation is relatively simple. At the same time, the execution efficiency and scalability for large data volumes are still relatively strong. However, the disadvantage is also very clear. First, users need to have a clear setting of the number of clusters K before performing clustering. This is not likely to be known in advance when dealing with most problems, generally, an optimal K value needs to be obtained through multiple experiments. Secondly, because the algorithm selects the initial cluster center randomly at the beginning, therefore, algorithms have poor tolerance for noise and isolated points. The so-called noise is the wrong data in the objects to be clustered, while the isolated point is the data that is far away from other data and has low similarity. For the K-means algorithm, once the isolated points and noise are selected as the cluster center at the beginning, the whole clustering process will be very problematic, how can we quickly find the number of clusters to be selected and find the center of the cluster, which can be greatly optimized?
The efficiency of the K-means clustering algorithm. Next we will introduce another clustering method: canopy clustering algorithm.

The canopy clustering algorithm is a simple, fast, and accurate method for grouping objects into classes. Each object is represented by a point in a multi-dimensional feature space. This algorithm uses a fast approximate distance measurement and two distance thresholds T1> T2 for processing. The basic algorithm is to create a canopy that contains the store from a vertex set and randomly delete it, and iterate over the remaining vertex set. For each vertex, if its distance from the first vertex is smaller than T1, then the vertex is added to the aggregation. If the distance is less than t2, delete the vertex from the collection. This will avoid any future processing if it is very close to the origin. This algorithm loops until the initial set is empty and aggregates the canopies of a set, each of which can contain one or more vertices. Each vertex can be contained in more than one canopy.

Canopy clustering is often used as an initial step for stricter clustering techniques, such as K-means clustering. Through an initial clustering, You can significantly reduce the number of consumed distance measurements by ignoring the points of the initial canopies.

The canopy clustering algorithm is often used to pre-process the K-means clustering algorithm to find the appropriate K-value and cluster center.

The biggest problem with K-means is that users must give the number of K in advance. K selection is generally based on experience values and multiple experiment results. For different datasets, K values are not authenticated. In addition, the K-means is sensitive to "noise" and isolated point data, and a small amount of such data can have a great impact on the average value.

Although the reference of the canopy clustering algorithm effectively solves the problem of the number of K selected initially, we can see that at Org. apache. mahout. clustering. syntheticcontrol. kmeans. in the job, canopydriver is called first. run method to select K and center, and then call kmeansdriver. run Method for kmeans clustering.

However, the introduction of the canopy algorithm will bring about new problems. For the canopy algorithm, the selection of the threshold T1 and T2 is another major problem. The relevant information cannot be found on the Internet, which is a headache...

References:

1. http://blog.csdn.net/airinsoul/article/details/6659647
2. http://blog.sina.com.cn/s/blog_62a9902f0100mr27.html
3. https://cwiki.apache.org/MAHOUT/canopy-clustering.html
4. mahout_in_action

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.