Increased clustering evaluation for Mahout

Source: Internet
Author: User
Introduction to Clustering algorithm and cluster evaluation Silhouette
Introduction to Clustering algorithms

Clustering (clustering) is a type of unsupervised learning (unsupervised learning) that divides a set of data into several categories, with the data in each class as similar as possible, while maximizing the differences between classes. By clustering, you can provide references for sample selection, root cause analysis, or as a preprocessing step for other algorithms. Clustering algorithm, the most classical to belong to the Kmeans algorithm, its basic idea is: suppose we want to put a group of data into N class, that is: the data of each sample as a vector, remember as ā first randomly selected n samples, the N-class as the center point of N, called centroid for the data All samples in, calculate to n centroid distance, distance from which center point nearest, belong to which class in each class, re-select centroid, assuming that the class has a K sample, then centroid to repeat 2,3 until the change of centroid is less than the preset value.
Mahout is an open-source machine learning software that provides application recommendations, clustering, classification, Logistic regression analysis and other algorithms. In particular, because of the combination of Hadoop's large data processing capabilities, each algorithm can be deployed as a standalone job conveniently on the Hadoop platform, so it has become more and more widely used. In the field of clustering, Mahout provides Kmeans,lda, Canopy and many other algorithms.
Introduction to Clustering evaluation algorithm Silhouette
In Kmeans, we will notice the need for us to pre-set aggregations into several categories. In fact, in the process of clustering we can not know beforehand, it can only be divided into 2 categories, 3 categories, ... n class this is attempted, and the clustering effect is evaluated each time. In fact, because of the unsupervised learning characteristics of clustering, no matter what algorithm needs to evaluate the effect. In the evaluation of clustering, there is an evaluation based on the external data, there is a simple evaluation based on the cluster itself, the basic idea is: in the same class, the data points closer and better, and the data points outside the class is better, the former is called the Cohesion Factor (cohension), The latter is called a discrete factor (separation). Combining the two together, the Silhouette factor is formed to evaluate the clustering effect: first, how to evaluate the clustering effect of a point: a = the average distance from a point to another point in the same cluster b=min (the average distance from a point to a point within another cluster) Silhouette factor s = 1–a/ B (a<b) or b/a-1 (A&GT;=B) measures the overall clustering effect, which is the average of the Silhouette factor for all points. The range should be ( -1,1), and the larger the value, the better the clustering effect.
Figure 1.Silhouette, the cohesion, discrete factor schematic



Take figure 1 for example. Figure 1 shows a cluster with 9 points, and three circular representations are clustered into three classes, where the Huang represents the centroid (centroid). To evaluate the clustering effect of the dark blue point in Figure 1, the poly factor A is the average distance from the point to the other three points in the circle. The calculation of discrete factor B is relatively complex: we need to find the average distance from the point to the three points in the upper right corner, which is recorded as B1, and then find the average distance between the points and the two points in the lower right corner of the circle, and the smaller value of B2;B1 and B2 is B. [Size=1.166em] In IBM's SPSS Clementine, there is also the implementation of the Silhouett evaluation algorithm, but IBM provides a simplified version, the distance from a point to a class average, simplified to the centroid (centroid) of the distance, Specifically, it is:
Figure 2.IBM Simplified implementation of the cohesive and discrete factors





It is also illustrated by the example of the 9 points described above that are clustered into 3 classes. IBM's implementation simplifies the implementation of a to the distance to the centroid of the dark blue point. When you calculate B, you need to calculate B1 and B2 first, and then the minimum value. However, the B1 is reduced to the center of the upper right corner, and the B2 is reduced to the center of the lower right corner of the circle. In the following, we try to use IBM's simplified formula to add clustering evaluation to Mahout.


Analysis of Mahout Clustering process

Mahout Introduction to the operating environment
As I said earlier, Mahout is dependent on the Hadoop environment, and every algorithm or accessibility is run as a separate job for Hadoop, so you must have a running Hadoop environment in place, (at least the Mahout0.9 you use at the time of this writing is still dependent on Hadoop) , how to install configure a running Hadoop environment is not covered in this article. Please refer to the Hadoop website yourself. It is necessary to note that the Hadoop used in this article is 2.2.0. After installing Hadoop, download mahout-distribution-0.9, extract the important content as follows: bin/: Mahout executable script Mahout-examples-0.9-job.jar in the directory, implementation classes of various algorithms example/various implementation algorithms of the source code conf/store each implementation class configuration file, which is important for driver.classes.default.props, if you add the implementation of the algorithm class, you can add configuration items in the file, which can be called by the Mahout startup script.

Executing Mahout alone is an introduction to the various functions implemented, as in the following example: Execute/data01/shanlei/src/mahout-distribution-0.9/bin/mahout output: [Bash Shell] Plain text view copy code?
01 02 03 04 05 06 07 08 09 10 11 12 13 14 15 16 17 18 19 20

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.