Hadoop implements the Kmeans algorithm--a MapReduce implementation

Last Update:2015-07-11 Source: Internet

Author: User

Tags dsap

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Write the MapReduce program to implement the Kmeans algorithm. Our idea may be

1. centroid after the second iteration

2. Map. Calculates the distance between each centroid and sample, obtains the centroid with the shortest distance from the sample, takes this centroid as the key, the sample as value, the output

3. In reduce, the input key is the centroid, value is the other sample, then again compute the cluster center, put the cluster center into a all variable T.

4. In main, the centroid and the centroid of this time have changed, assuming the change, then continue iterating, otherwise exit.

The idea of this article is basically to follow the above steps, only a few problems need to be solved

1. Hadoop does not exist with its own defined global variables. So the idea of defining a global variable to hold the centroid is not achievable. So an alternative idea is to put the quality in the file.

2. Where the file that holds the centroid is read, assuming it is read in the map. Then it is certain that we cannot implement an iteration with a mapreduce. So we chose to read the centroid in the main function and set the centroid to the configuration. The configuration is readable in both map and reduce

3. How to compare the centroid of the change, is in main in the comparison, read the center of Mass and the last centroid of the file and then compare. This method is able to achieve, but it does not appear to be high handsome, this time we use their own definition of counter,counter is a global variable, in map and reduce can be read and write, in the above thinking, We see that reduce is the centroid of the last iteration and the centroid that has just been computed. So it is perfectly possible to compare directly in reduce. Assuming that nothing has changed, counter plus 1. Just get the counter value in main.

Comb, the detailed process such as the following

1. Main function reads centroid file

2. Placing a string of centroid in the configuration

3. In the Mapper class, rewrite the Setup method to get the centroid content of the configuration . Parsed into a two-dimensional array form. Represent centroid

4. The map method in the Mapper class reads the sample file, comparing it to the full centroid. Draw each sample with which centroid is recent, then output < centroid, sample >

5. Once again the centroid of the Reducer class is computed, assuming that the centroid of the computed mass is consistent with the centroid of the coming in, then counter plus 1 of its own definition

6. In main, get the value of the counter to see if it equals the centroid, assuming that it is not equal, then continue the iteration, no in the exit

A detailed implementation such as the following

1. Pom Dependency

This should be consistent with the cluster. Because the assumptions are inconsistent, there is no problem when calculating other problems. But there will be a malfunction when using counter.

Java.lang.IncompatibleClassChangeError:FoundInterfaceOrg.apache.hadoop.mapreduce.Counter, butclasswas expected

The reason is: actually starting from 2.0. Org.apache.hadoop.mapreduce.Counter from the 1.0 version of the class to interface, to see if you import this class is the classes or interface, if it is a class then the import is incorrect, need to change

2. Sample

Instance samples such as the following

1,12,23,3-3,-3-4,-4-5,-5

3. centroid

This centroid is randomly found from the sample.

1,12,2

4. Code implementation

First, define a center class, which mainly holds the number K of the centroid, and two methods of reading the centroid file from HDFs, one to read the initial centroid. In this real file, the other is used to read the centroid directory after each iteration, this is in the directory, code such as the following

Center class

public class Center {protected static int k = number of 2;//centroid/** * Loads the centroid from the initial centroid file and returns a string. tab cut between centroid * @param path * @return * @throws ioexception */public String loadinitcenter (path Path) throws IOException {Str Ingbuffer sb = new StringBuffer (); Configuration conf = new configuration (); FileSystem HDFs = filesystem.get (conf); Fsdatainputstream dis = hdfs.open (path); Linereader in = new Linereader (dis, conf); Text line = new text (), while (In.readline (line) > 0) {sb.append (line.tostring (). Trim ()); Sb.append ("\ t");} Return sb.tostring (). Trim ();} /** * Reads the centroid from the centroid file of each iteration and returns the string * @param path * @return * @throws ioexception */public string loadcenter (path path) throws IO Exception {stringbuffer sb = new StringBuffer (); Configuration conf = new configuration (); FileSystem HDFs = filesystem.get (conf); filestatus[] files = hdfs.liststatus (path), for (int i = 0; i < files.length; i++) {Path FilePath = Files[i].getpath (); if (!filepath.getname (). Contains ("part")) continue; Fsdatainputstream dis = hdfs.open (filePath);Linereader in = new Linereader (dis, conf); Text line = new text (), while (In.readline (line) > 0) {sb.append (line.tostring (). Trim ()); Sb.append ("\ t");}} Return sb.tostring (). Trim ();}}

KMEANSMR class

public class KMEANSMR {private static String FLAG = "Kcluster";p ublic static class Tokenizermapper extends Mapper<o Bject, text, text, text>{double[][] centers = new double[center.k][]; string[] Centerstrarray = null; @Overridepublic void Setup (context context) {//the cluster center that will be placed in the context is converted to the form of an array. Easy to use string kmeanss = Context.getconfiguration (). Get (FLAG), Centerstrarray = Kmeanss.split ("\ t"); for (int i = 0; i < cent Erstrarray.length;  i++) {string[] Segs = Centerstrarray[i].split (","); centers[i] = new Double[segs.length];for (int j = 0; j < Segs.length; J + +) {Centers[i][j] = double.parsedouble (Segs[j]);}}  public void Map (Object key, Text value, Context context) throws IOException, interruptedexception {String line = Value.tostring (); string[] Segs = Line.split (",");d ouble[] Sample = new Double[segs.length];for (int i = 0; i < segs.length; i++) {sample[ I] = Float.parsefloat (Segs[i]);} The centroid of the nearest distance is obtained Double min = double.max_value;int index = 0;for (int i = 0; i < CENTERS.LEngth; i++) {Double dis = distance (centers[i], sample), if (Dis < min) {min = Dis;index = i;}} Context.write (new text (Centerstrarray[index]), new text (line));}} public static class Intsumreducer extends Reducer<text,text,nullwritable,text> {Counter Counter = null;public vo ID reduce (Text key, iterable<text> values, context context) throws Ioexcepti On, interruptedexception {double[] sum = new Double[center.k];int size = 0;//Calculates the sums of the values on the corresponding dimension. stored in the sum array for (Text text:values) {string[] Segs = text.tostring (). Split (","); for (int i = 0; i < segs.length; i++) {sum[ I] + = double.parsedouble (Segs[i]);} Size + +;} Find the average of each dimension in the sum array. That is, the new centroid stringbuffer sb = new StringBuffer (); for (int i = 0; i < sum.length; i++) {sum[i]/= size;sb.append (sum[i]); Sb.ap Pend (",");} /** infers whether the new centroid is the same as the old centroid */boolean flag = true; string[] Centerstrarray = key.tostring (). Split (","); for (int i = 0; i < centerstrarray.length; i++) {if (Math.Abs (Double . parsedouble (CenterstrarraY[i])-sum[i]) > 0.00000000001) {flag = False;break;}} Assuming that the new centroid is the same as the old centroid, then the corresponding counter plus 1IF (flag) {counter = Context.getcounter ("MyCounter", "Kmenascounter"); counter.increment (1l);} Context.write (NULL, New Text (Sb.tostring ()));}} public static void Main (string[] args) throws Exception {path Kmeanspath = new Path ("/dsap/middata/kmeans/kmeans");//Initial Mass Heart file path Samplepath = new Path ("/dsap/middata/kmeans/sample");//sample file//Loading Cluster Center file Center Center = New Center ();  String centerstring = Center.loadinitcenter (kmeanspath); int index = 0;//Iteration number while (Index < 5) {Configuration conf = new Configuration (); Conf.set (FLAG, centerstring);//Place the cluster center string in the configuration Kmeanspath = new Path ("/dsap/middata/ Kmeans/kmeans "+ index);//output path for this iteration. is also the next time the read path of the centroid/** infers whether the output path exists. If present, then delete */filesystem HDFs = filesystem.get (conf); if (Hdfs.exists (Kmeanspath)) Hdfs.delete (Kmeanspath); Job Job = new Job (conf, "Kmeans" + index); Job.setjarbyclass (Kmeansmr.class); Job.setmapperclass (Tokenizermapper.class); Job.setreducerclass (IntSumReducer.Class); Job.setoutputkeyclass (Nullwritable.class); Job.setoutputvalueclass (Text.class); Job.setmapoutputkeyclass (    Text.class);    Job.setmapoutputvalueclass (Text.class);    Fileinputformat.addinputpath (Job, Samplepath); Fileoutputformat.setoutputpath (Job, Kmeanspath); Job.waitforcompletion (true);/** gets the size of its own defined counter, assuming equal to the size of the centroid. Indicates that the centroid has not changed, then the program stops iterating */long counter = Job.getcounters (). Getgroup ("MyCounter"). Findcounter ("Kmenascounter"). GetValue (); if (counter = = CENTER.K) system.exit (0);/** again loads the centroid */center = new Center (); centerstring = Center.loadcenter ( Kmeanspath); index + +;} System.exit (0);} public static double distance (double[] A, double[] b) {if (a = = NULL | | b = = NULL | | A.length! = b.length) return double.ma x_value;double dis = 0;for (int i = 0; i < a.length; i++) {dis + = Math.pow (A[i]-b[i], 2);} return math.sqrt (DIS);}}

5. Results

Two directories were generated. Each is the first and second iteration of the cluster center

The final cluster center content such as the following

Hadoop implements the Kmeans algorithm--a MapReduce implementation

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More