Mahout clustering algorithm-kmeans Analysis

Last Update:2018-12-05 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. K-means clustering algorithm principle

The K-means algorithm accepts the parameter K. Then, the N Data Objects input in advance are divided into k clusters to meet the cluster requirements: the object similarity in the same cluster is high; the similarity between objects in different clusters is small. Clustering similarity is calculated by using the mean value of objects in each cluster to obtain a "central object" (gravity center.

The K-means algorithm is the most classic partition-based clustering method and one of the top 10 typical data mining algorithms. The basic idea of the K-means algorithm is to cluster K points in a space and classify objects closest to them. Update the values of each cluster center by iteration until the best clustering result is obtained.

Assume that the sample set is divided into C categories. The algorithm is described as follows:

(1) Select the initial center of class C as appropriate;

(2) In the K iteration, the distance from any sample to the C center is obtained, and the sample is classified into the class with the shortest center;

(3) Update the center value of the class by means of the mean value;

(4) For all the C clustering centers, if the value remains unchanged after iteration (2) (3), the iteration ends. Otherwise, the iteration continues.

The biggest advantage of this algorithm is its simplicity and speed. The key of an algorithm is the initial Center Selection and distance formula.

II. Implementation of mahout kmeans clustering:
(1) The input parameter specifies all data points to be clustered, and clusters specifies the initial cluster center.
If the parameter k is specified
You can use org. Apache. hadoop. FS to randomly read K points from the specified input file and put them in clusters.

(2) Calculate this iteration based on the original data point and the clustering center of the previous iteration (or initial clustering ).
Output to the clusters-N directory.
This process is performed
Kmeansmapper \ kmeanscombiner \ kmeansreducer \ kmeansdriver implementation

Kmeansmapper: Read the previous iteration generation or initial cluster center during mapper initialization in configure
(Each mapper is read into all cluster centers );
The map method calculates the class closest to each vertex of the input and adds the class
The output key is the cluster ID of the vertex, and the value is the kmeansinfo instance, which includes the number of vertices and the sum of each component.

Kmeanscombiner: accumulates the sum of the number of points and the sum of each component under the same cluster ID output by kmeansmapper.

Kmeansreducer: calculates the clustering center of this iteration by accumulating the number of points and the sum of each component under the same cluster ID;
Determine whether the clustering has been converged based on the input delta: the distance between the last iteration cluster center and the current iteration cluster center <delta;
Output the clustering centers and whether or not to converge the labels

Kmeansdriver: controls the iteration process until the maximum number of iterations is exceeded or all clusters have converged.
After each iteration, kmeansdriver reads all the clusters in its clusters-N directory.
Then the entire kmeans clustering process converges.

Parameter Adjustment:

Manhout kmeans clustering has two important parameters: convergence delta and maximum number of iterations

The smaller the Delta value, the higher the convergence condition. Therefore, the number of eventually converged clusters may decrease,

The maximum number of iterations is determined by the number of converged clusters after each iteration. The iteration can be stopped when the number of converged clusters is almost no longer changing or fluctuating.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Mahout clustering algorithm-kmeans Analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Mahout clustering algorithm-kmeans Analysis

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support