Hadoop implements the Kmeans algorithm--a MapReduce implementation

Source: Internet
Author: User
Tags dsap

Write the MapReduce program to implement the Kmeans algorithm. Our idea may be

1. centroid after the second iteration

2. Map. Calculates the distance between each centroid and sample, obtains the centroid with the shortest distance from the sample, takes this centroid as the key, the sample as value, the output

3. In reduce, the input key is the centroid, value is the other sample, then again compute the cluster center, put the cluster center into a all variable T.

4. In main, the centroid and the centroid of this time have changed, assuming the change, then continue iterating, otherwise exit.

The idea of this article is basically to follow the above steps, only a few problems need to be solved

1. Hadoop does not exist with its own defined global variables. So the idea of defining a global variable to hold the centroid is not achievable. So an alternative idea is to put the quality in the file.

2. Where the file that holds the centroid is read, assuming it is read in the map. Then it is certain that we cannot implement an iteration with a mapreduce. So we chose to read the centroid in the main function and set the centroid to the configuration. The configuration is readable in both map and reduce

3. How to compare the centroid of the change, is in main in the comparison, read the center of Mass and the last centroid of the file and then compare. This method is able to achieve, but it does not appear to be high handsome, this time we use their own definition of counter,counter is a global variable, in map and reduce can be read and write, in the above thinking, We see that reduce is the centroid of the last iteration and the centroid that has just been computed. So it is perfectly possible to compare directly in reduce. Assuming that nothing has changed, counter plus 1. Just get the counter value in main.

Comb, the detailed process such as the following

1. Main function reads centroid file

2. Placing a string of centroid in the configuration

3. In the Mapper class, rewrite the Setup method to get the centroid content of the configuration . Parsed into a two-dimensional array form. Represent centroid

4. The map method in the Mapper class reads the sample file, comparing it to the full centroid. Draw each sample with which centroid is recent, then output < centroid, sample >

5. Once again the centroid of the Reducer class is computed, assuming that the centroid of the computed mass is consistent with the centroid of the coming in, then counter plus 1 of its own definition

6. In main, get the value of the counter to see if it equals the centroid, assuming that it is not equal, then continue the iteration, no in the exit

A detailed implementation such as the following

1. Pom Dependency

This should be consistent with the cluster. Because the assumptions are inconsistent, there is no problem when calculating other problems. But there will be a malfunction when using counter.

Java.lang.IncompatibleClassChangeError:FoundInterfaceOrg.apache.hadoop.mapreduce.Counter, butclasswas expected

The reason is: actually starting from 2.0. Org.apache.hadoop.mapreduce.Counter from the 1.0 version of the class to interface, to see if you import this class is the classes or interface, if it is a class then the import is incorrect, need to change

2. Sample

Instance samples such as the following

1,12,23,3-3,-3-4,-4-5,-5

3. centroid

This centroid is randomly found from the sample.

1,12,2

4. Code implementation

First, define a center class, which mainly holds the number K of the centroid, and two methods of reading the centroid file from HDFs, one to read the initial centroid. In this real file, the other is used to read the centroid directory after each iteration, this is in the directory, code such as the following

Center class

public class Center {protected static int k = number of 2;//centroid/** * Loads the centroid from the initial centroid file and returns a string. tab cut between centroid * @param path * @return * @throws ioexception */public String loadinitcenter (path Path) throws IOException {Str Ingbuffer sb = new StringBuffer (); Configuration conf = new configuration (); FileSystem HDFs = filesystem.get (conf); Fsdatainputstream dis = hdfs.open (path); Linereader in = new Linereader (dis, conf); Text line = new text (), while (In.readline (line) > 0) {sb.append (line.tostring (). Trim ()); Sb.append ("\ t");} Return sb.tostring (). Trim ();} /** * Reads the centroid from the centroid file of each iteration and returns the string * @param path * @return * @throws ioexception */public string loadcenter (path path) throws IO Exception {stringbuffer sb = new StringBuffer (); Configuration conf = new configuration (); FileSystem HDFs = filesystem.get (conf); filestatus[] files = hdfs.liststatus (path), for (int i = 0; i < files.length; i++) {Path FilePath = Files[i].getpath (); if (!filepath.getname (). Contains ("part")) continue; Fsdatainputstream dis = hdfs.open (filePath);Linereader in = new Linereader (dis, conf); Text line = new text (), while (In.readline (line) > 0) {sb.append (line.tostring (). Trim ()); Sb.append ("\ t");}} Return sb.tostring (). Trim ();}}

KMEANSMR class

public class KMEANSMR {private static String FLAG = "Kcluster";p ublic static class Tokenizermapper extends Mapper<o Bject, text, text, text>{double[][] centers = new double[center.k][]; string[] Centerstrarray = null; @Overridepublic void Setup (context context) {//the cluster center that will be placed in the context is converted to the form of an array. Easy to use string kmeanss = Context.getconfiguration (). Get (FLAG), Centerstrarray = Kmeanss.split ("\ t"); for (int i = 0; i < cent Erstrarray.length;  i++) {string[] Segs = Centerstrarray[i].split (","); centers[i] = new Double[segs.length];for (int j = 0; j < Segs.length; J + +) {Centers[i][j] = double.parsedouble (Segs[j]);}}  public void Map (Object key, Text value, Context context) throws IOException, interruptedexception {String line = Value.tostring (); string[] Segs = Line.split (",");d ouble[] Sample = new Double[segs.length];for (int i = 0; i < segs.length; i++) {sample[ I] = Float.parsefloat (Segs[i]);} The centroid of the nearest distance is obtained Double min = double.max_value;int index = 0;for (int i = 0; i < CENTERS.LEngth; i++) {Double dis = distance (centers[i], sample), if (Dis < min) {min = Dis;index = i;}} Context.write (new text (Centerstrarray[index]), new text (line));}} public static class Intsumreducer extends Reducer<text,text,nullwritable,text> {Counter Counter = null;public vo ID reduce (Text key, iterable<text> values, context context) throws Ioexcepti On, interruptedexception {double[] sum = new Double[center.k];int size = 0;//Calculates the sums of the values on the corresponding dimension. stored in the sum array for (Text text:values) {string[] Segs = text.tostring (). Split (","); for (int i = 0; i < segs.length; i++) {sum[ I] + = double.parsedouble (Segs[i]);} Size + +;} Find the average of each dimension in the sum array. That is, the new centroid stringbuffer sb = new StringBuffer (); for (int i = 0; i < sum.length; i++) {sum[i]/= size;sb.append (sum[i]); Sb.ap Pend (",");} /** infers whether the new centroid is the same as the old centroid */boolean flag = true; string[] Centerstrarray = key.tostring (). Split (","); for (int i = 0; i < centerstrarray.length; i++) {if (Math.Abs (Double . parsedouble (CenterstrarraY[i])-sum[i]) > 0.00000000001) {flag = False;break;}} Assuming that the new centroid is the same as the old centroid, then the corresponding counter plus 1IF (flag) {counter = Context.getcounter ("MyCounter", "Kmenascounter"); counter.increment (1l);} Context.write (NULL, New Text (Sb.tostring ()));}} public static void Main (string[] args) throws Exception {path Kmeanspath = new Path ("/dsap/middata/kmeans/kmeans");//Initial Mass Heart file path Samplepath = new Path ("/dsap/middata/kmeans/sample");//sample file//Loading Cluster Center file Center Center = New Center ();  String centerstring = Center.loadinitcenter (kmeanspath); int index = 0;//Iteration number while (Index < 5) {Configuration conf = new Configuration (); Conf.set (FLAG, centerstring);//Place the cluster center string in the configuration Kmeanspath = new Path ("/dsap/middata/ Kmeans/kmeans "+ index);//output path for this iteration. is also the next time the read path of the centroid/** infers whether the output path exists. If present, then delete */filesystem HDFs = filesystem.get (conf); if (Hdfs.exists (Kmeanspath)) Hdfs.delete (Kmeanspath); Job Job = new Job (conf, "Kmeans" + index); Job.setjarbyclass (Kmeansmr.class); Job.setmapperclass (Tokenizermapper.class); Job.setreducerclass (IntSumReducer.Class); Job.setoutputkeyclass (Nullwritable.class); Job.setoutputvalueclass (Text.class); Job.setmapoutputkeyclass (    Text.class);    Job.setmapoutputvalueclass (Text.class);    Fileinputformat.addinputpath (Job, Samplepath); Fileoutputformat.setoutputpath (Job, Kmeanspath); Job.waitforcompletion (true);/** gets the size of its own defined counter, assuming equal to the size of the centroid. Indicates that the centroid has not changed, then the program stops iterating */long counter = Job.getcounters (). Getgroup ("MyCounter"). Findcounter ("Kmenascounter"). GetValue (); if (counter = = CENTER.K) system.exit (0);/** again loads the centroid */center = new Center (); centerstring = Center.loadcenter ( Kmeanspath); index + +;} System.exit (0);} public static double distance (double[] A, double[] b) {if (a = = NULL | | b = = NULL | | A.length! = b.length) return double.ma x_value;double dis = 0;for (int i = 0; i < a.length; i++) {dis + = Math.pow (A[i]-b[i], 2);} return math.sqrt (DIS);}}

5. Results

Two directories were generated. Each is the first and second iteration of the cluster center


The final cluster center content such as the following


Copyright notice: This article blog original article. Blogs, without consent, may not be reproduced.

Hadoop implements the Kmeans algorithm--a MapReduce implementation

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.