Mahout clustering algorithm-kmeans Analysis

Source: Internet
Author: User

I. K-means clustering algorithm principle

The K-means algorithm accepts the parameter K. Then, the N Data Objects input in advance are divided into k clusters to meet the cluster requirements: the object similarity in the same cluster is high; the similarity between objects in different clusters is small. Clustering similarity is calculated by using the mean value of objects in each cluster to obtain a "central object" (gravity center.

The K-means algorithm is the most classic partition-based clustering method and one of the top 10 typical data mining algorithms. The basic idea of the K-means algorithm is to cluster K points in a space and classify objects closest to them. Update the values of each cluster center by iteration until the best clustering result is obtained.

Assume that the sample set is divided into C categories. The algorithm is described as follows:

(1) Select the initial center of class C as appropriate;

(2) In the K iteration, the distance from any sample to the C center is obtained, and the sample is classified into the class with the shortest center;

(3) Update the center value of the class by means of the mean value;

(4) For all the C clustering centers, if the value remains unchanged after iteration (2) (3), the iteration ends. Otherwise, the iteration continues.

The biggest advantage of this algorithm is its simplicity and speed. The key of an algorithm is the initial Center Selection and distance formula.

II. Implementation of mahout kmeans clustering:
(1) The input parameter specifies all data points to be clustered, and clusters specifies the initial cluster center.
If the parameter k is specified
You can use org. Apache. hadoop. FS to randomly read K points from the specified input file and put them in clusters.

(2) Calculate this iteration based on the original data point and the clustering center of the previous iteration (or initial clustering ).
Output to the clusters-N directory.
This process is performed
Kmeansmapper \ kmeanscombiner \ kmeansreducer \ kmeansdriver implementation

Kmeansmapper: Read the previous iteration generation or initial cluster center during mapper initialization in configure
(Each mapper is read into all cluster centers );
The map method calculates the class closest to each vertex of the input and adds the class
The output key is the cluster ID of the vertex, and the value is the kmeansinfo instance, which includes the number of vertices and the sum of each component.

Kmeanscombiner: accumulates the sum of the number of points and the sum of each component under the same cluster ID output by kmeansmapper.

Kmeansreducer: calculates the clustering center of this iteration by accumulating the number of points and the sum of each component under the same cluster ID;
Determine whether the clustering has been converged based on the input delta: the distance between the last iteration cluster center and the current iteration cluster center <delta;
Output the clustering centers and whether or not to converge the labels

Kmeansdriver: controls the iteration process until the maximum number of iterations is exceeded or all clusters have converged.
After each iteration, kmeansdriver reads all the clusters in its clusters-N directory.
Then the entire kmeans clustering process converges.

Parameter Adjustment:

Manhout kmeans clustering has two important parameters: convergence delta and maximum number of iterations

The smaller the Delta value, the higher the convergence condition. Therefore, the number of eventually converged clusters may decrease,

The maximum number of iterations is determined by the number of converged clusters after each iteration. The iteration can be stopped when the number of converged clusters is almost no longer changing or fluctuating.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.