Mapreduce and k-means clustering

Source: Internet
Author: User

Google offers slides and presentations on senior research topics
Online including distributed systems. And one
These presentations discusses mapreduce in the context of clustering algorithms.

One of the claims made in this participates presentation is that "It can be necessary to send tons of data to each mapper node. depending on your bandwidth and memory available, this cocould be impossible. "This claim is
False, which in turn removes much of the motivation for the alternative algorithm, which called "canopy clustering ".

The K-means clustering algorithm

There are too good introductions to K-means clustering available, including our book

Data Mining Techniques for marketing, sales, and customer support. The Google presentation mentioned above provides a very brief introduction.

Let's review the K-means clustering algorithm. Given a data set where all the columns are numeric, the algorithm for K-means clustering is basically the following:

(1) start with K cluster centers (chosen randomly or according to some specific procedure ).
(2) Assign each row in the data to its nearest cluster center.
(3) re-calculate the cluster centers as the "average" of the rows in (2 ).
(4) Repeat, until the cluster centers no longer change or some other stopping criterion has been met.

In the end, the K-means algorithm "Colors" all the rows in the data set, so similar rows have the same color.

K-means in a parallel world

To run this algorithm, it seems, at first, as though all the rows assigned to each cluster in step (2) need to be brought together to recalculate the cluster centers.

However, this is not true. k-means clustering is an example of an embarrassingly parallel algorithm, meaning that it is very well suited to parallel implementations. in fact, it is quite adaptable to both SQL and to mapreduce, with efficient algorithms.
By "efficient", I mean that large amounts of data do not need to be sent around processors and that the processors have minimum amounts of communication. it is true that the entire data set does need to be read by the processors for each iteration of
Algorithm, but each row only needs to be read by one processor.

A parallel version of the K-means algorithm was inreceivated into the Darwin Data Mining package, developed by Thinking Machines Corporation in the early 1990 s. I do not know if this was the first parallel implementation of the algorithm. darwin was later
Purchased by Oracle, and became the basis for Oracle Data Mining.

How does the parallel version work? The data can be partitioned among multiple processors (or streams or threads). Each processor can read the previous iteration's cluster centers and assign the rows on the processor to clusters. Each processor then calculates
New centers for its of data. Each actual cluster center (for the data processing SS all processors) is then the weighted average of the centers on each processor.

In other words, the rows of data do not need to be combined globally. they can be combined locally, with the specified CED set of results combined into SS all processors. in fact, mapreduce even contains a "combine" Method for just this type of algorithm.

All that remains is figuring out how to handle the cluster center information. Let us postulate a shared file that has the centroids as calculated for each processor. This file contains:

  • The iteration number.
  • The cluster ID.
  • The cluster coordinates.
  • The number of rows assigned to the cluster.

This is the centroid file. An Iteration through the algorithm is going to add another set of rows to this file. This information is the only information that needs to be communicated globally.

There are two ways to do this in the mapreduce framework. The first uses map, combine, and reduce. The second only uses map and reduce.

K-means using map, combine, reduce

Before ining, a file is created accessible to all processors that contains initial centers for all clusters. This file contains the cluster centers for each iteration.

The map function reads this file to get the centers from the last finished iteration. it then reads the input rows (the data) and calculates the distance to each center. for each row, it produces an output pair:

  • Key -- cluster ID;
  • Value -- coordinates of row.

Now, this is a lot of data, so we use a combine function to reduce the size before sending it to reduce. the combine function calculates the average of the coordinates for each cluster ID, along with the number of records. this is simple, and it produces
One record of output for each cluster:

  • Key is cluster
  • Value is number of records and average values of the coordinates.

The amount of data now is the number of clusters times the number of processors times the size of the information needed to define each cluster. This is small relative to the data size.

The reduce function (and one of these is probably sufficient for this problem regardless of data size and the number of maps) calcualtes the weighted average of its input. its output shoshould be written to a file, and contain:

  • The iteration number;
  • The cluster ID;
  • The cluster center coordinates;
  • The size of the cluster.

The iteration process can than continue.

K-means using just map and reduce

Using just map and reduce, it is possible to do the same things. In this case, the map and combine functions described abve are combined into a single function.

So, the map function does the following:

  • Initializes itself with the cluster centers from the previous iteration;
  • Keeps information about each cluster in memory. This information is the total number of records assigned to the cluster in the processor and the total of each coordinate.
  • For each record, it updates the information in memory.
  • It then outputs the key-value pairs for the combine function described above.

The reduce function is the same as above.

K-means using SQL

Of course, one of my purposes in discussing mapreduce has been to understand whether and how it is more powerful than SQL. for each teen years, databases have been the only data-parallel application readily available. the parallelism is hidden underneath
SQL language, so far people using SQL do not fully appreciate the power they are using.

An iteration of K-means looks like:

SELECT @iteration+1, cluster_id,
.......AVERAGE(d.data) as center
FROM (SELECT d.data, cc.cluster_id,
.............ROW_NUMBER() OVER (PARTITION BY d.data
................................ORDER BY DISTANCE(d.data, cc.center) as ranking
......FROM data d CROSS JOIN
.....(SELECT *
......FROM cluster_centers cc
......WHERE iteration = @iteration) cc
.....) a
WHERE ranking = 1
GROUP BY cluster_id

This Code assumes the existence of functions or code for
Average () and distance () functions. These are placeholders for the correct functions. Also, it uses analytic functions. (If you are not familiar with these, I recommend my book

Data analysis using SQL and Excel .)

The efficiency of the SQL code is determined, to a large extent, by the analytic function that ranks all the cluster centers. we hope that a powerful parallel engine will recognize that the data is all in one place, and hence that this function will be quite
Efficient.

A final note about K-means clustering

The K-means clustering algorithm does require reading through all the data for each iteration through the algorithm. in general, it tends to converge rather quickly (tens of iterations), so this may not be an issue. also, the I/O for reading the data can
All be local I/O, rather than sending large amounts of data through the network.

For most purposes, if you are dealing with a really big dataset, You can sample it down to a fraction of its original size to get reasonable clusters. if you are not satisfied with this method, then sample the data, find the centers of the clusters, and then
Use these to initialize the centers for the overall data. this will probably reduce the number of iterations through the entire data to less than 10 (one pass for the sample, a handful for the final clustering ).

When running the algorithm on very large amounts of data, numeric overflow is a very real issue. this is another reason why clustering locally, taking averages, and then taking the weighted average globally is beneficial -- and why doing sample is a good way
To begin.

Also, before clustering, it is a good idea to standardize numeric variables (subtract the average and divide by the standard deviation ).

-- Gordon
Check out my latest book
Data analysis using SQL and excel.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.