-mahout-canopy Clustering practice of data mining

Source: Internet
Author: User

1. Explanation of principle


(1) The original data collection list is sorted according to certain rules, and the initial distance threshold is set to T1, T2,t1>t2.


(2) Randomly select a data vector A in the list, using a rough distance calculation method to calculate the distance d between a and the other sample data vectors in the list.


(3) According to the distance d in 2, the sample data vector of D less than T1 is divided into a canopy, and the sample data vector of D less than T2 is removed from the list.


(4) Repeat 2, 3, until list is empty



2. Download test data


Cd/tmp


Hadoop dfs-mkdir/input


wget Http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data


Hadoop Dfs-copyfromlocal/tmp/synthetic_control.data/input/synthetic_control.data


3. Format conversion (text → Vector)


Edit File Text2vectorwritable.jar


Package mahout.fansy.utils.transform;

Import java.io.IOException;

Import org.apache.hadoop.conf.Configuration;

Import Org.apache.hadoop.fs.Path;

Import org.apache.hadoop.io.LongWritable;

Import Org.apache.hadoop.io.Text;

Import Org.apache.hadoop.mapreduce.Job;

Import Org.apache.hadoop.mapreduce.Mapper;

Import Org.apache.hadoop.mapreduce.Reducer;

Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

Import Org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;

Import Org.apache.hadoop.util.ToolRunner;

Import Org.apache.mahout.common.AbstractJob;

Import Org.apache.mahout.math.RandomAccessSparseVector;

Import Org.apache.mahout.math.Vector;

Import org.apache.mahout.math.VectorWritable;

/**

*--* transform text data to vectorwritable data

*--* @author Fansy

*    --*

*     --*/

public class Text2vectorwritable extends abstractjob{

public static void Main (string[] args) throws exception{

Toolrunner.run (New Configuration (), New text2vectorwritable (), args);

}

@Override

public int run (string[] arg0) throws Exception {

Addinputoption ();

Addoutputoption ();

if (parsearguments (arg0) = = null) {

return-1;

}

Path Input=getinputpath ();

Path Output=getoutputpath ();

Configuration conf=getconf ();

Set Job information

Job Job=new Job (conf, "text2vectorwritablecopy with Input:" +input.getname ());

Job.setoutputformatclass (Sequencefileoutputformat.class);

Job.setmapperclass (Text2vectorwritablemapper.class);

Job.setmapoutputkeyclass (Longwritable.class);

Job.setmapoutputvalueclass (Vectorwritable.class);

Job.setreducerclass (Text2vectorwritablereducer.class);

Job.setoutputkeyclass (Longwritable.class);

Job.setoutputvalueclass (Vectorwritable.class);

Job.setjarbyclass (Text2vectorwritable.class);

Fileinputformat.addinputpath (Job, input);

Sequencefileoutputformat.setoutputpath (job, output);

if (!job.waitforcompletion (True)) {//wait for the job was done

throw new Interruptedexception ("Canopy Job failed processing" + input);

}

return 0;

}

/**

* Mapper Main procedure

* @author Fansy

*

--*/

public static class Text2vectorwritablemapper extends mapper<longwritable,text,longwritable,vectorwritable>{

public void Map (longwritable key,text value,context Context) throws ioexception,interruptedexception{

String[] Str=value.tostring (). Split ("\\s{1,}");

Split data Use one or more blanker

Vector vector=new Randomaccesssparsevector (str.length);

for (int i=0;i<str.length;i++) {

Vector.set (i, double.parsedouble (Str[i]));

}

vectorwritable va=new vectorwritable (vector);

Context.write (Key, VA);

}

}

/**

* Reducer:do nothing but output

* @author Fansy

*

--*/

public static class Text2vectorwritablereducer extends Reducer<longwritable,vectorwritable,longwritable, vectorwritable>{

public void reduce (longwritable key,iterable<vectorwritable> values,context Context) throws IOException, interruptedexception{

for (vectorwritable v:values) {

Context.write (key, V);

}

}

}

}

Compile, output Clusteringutils.jar, and copy to/home/mahout/mahout_jar

Select Export→runnable jar file→extract required libraries into generated jar when output



Then execute:

Hadoop Jar/home/hadoop/mahout/mahout_jar/clusteringutils.jar Mahout.fansy.utils.transform.text2vectorwritable-i Hdfs:///input/synthetic_control.data-o Hdfs:///input/synthetic_control.data.transform


It is possible to encounter Org/apache/mahout/common/abstractjob cannot find the class error, this is generally due to the Hadoop_classpath configuration location does not include the Mahout jar.


Workaround 1:

Copy the Mahout jar file into the/home/hadoop/lib and confirm that the/home/hadoop/lib is indeed in Hadoop_classpath


Cp/home/hadoop/mahout/*.jar/home/hadoop/hadoop/lib


Workaround 2 (Recommended):

Join in the hadoop-env.sh


for f In/home/hadoop/mahout/*.jar; Do

If ["$HADOOP _classpath"]; Then

Export hadoop_classpath= $HADOOP _classpath: $f

Else

Export Hadoop_classpath= $f

Fi

Done


Remember to distribute hadoop-evn.sh to other nodes


Restarting the Hadoop environment

stop-all.sh

start-all.sh


To perform the conversion:


Hadoop Jar/home/hadoop/mahout/mahout_jar/clusteringutils.jar Mahout.fansy.utils.transform.text2vectorwritable-i Hdfs:///input/synthetic_control.data-o Hdfs:///input/synthetic_ Control.data.transform (if the main class is already assigned when the jar is exported, this command will error, use the following command)

Hadoop Jar/home/hadoop/mahout/mahout_jar/clusteringutils.jar-o Hdfs:///input/synthetic_control.data.transform


The output files are already unrecognizable vector files.

hdfs:///input/synthetic_control.data.transform/part-r-00000


4, the implementation of canopy clustering

Mahout Canopy--input hdfs:///input/synthetic_control.data.transform/part-r-00000--output/output/canopy-- Distancemeasure org.apache.mahout.common.distance.EuclideanDistanceMeansure--t1--t2--t3


5. Conversion format (vector → text)


Convert the result in 4 to a text


Edit File Readclusterwritable.java


Package mahout.fansy.utils;

Import java.io.IOException;

Import org.apache.hadoop.conf.Configuration;

Import Org.apache.hadoop.io.Text;

Import Org.apache.hadoop.mapreduce.Job;

Import Org.apache.hadoop.mapreduce.Mapper;

Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

Import Org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;

Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

Import Org.apache.hadoop.util.ToolRunner;

Import org.apache.mahout.clustering.iterator.ClusterWritable;

Import Org.apache.mahout.common.AbstractJob;

Import Org.slf4j.Logger;

Import Org.slf4j.LoggerFactory;

/**

?? * Read Cluster centers

?? * @author Fansy

?? */

public class Readclusterwritable extends Abstractjob {

public static void Main (string[] args) throws exception{

Toolrunner.run (New Configuration (), New readclusterwritable (), args);

}

@Override

public int run (string[] args) throws Exception {

Addinputoption ();

Addoutputoption ();

if (parsearguments (args) = = null) {

return-1;

}

Job Job=new Job (getconf (), Getinputpath (). toString ());

Job.setinputformatclass (Sequencefileinputformat.class);

Job.setmapperclass (Rm.class);

Job.setmapoutputkeyclass (Text.class);

Job.setmapoutputvalueclass (Text.class);

Job.setnumreducetasks (0);

Job.setjarbyclass (Readclusterwritable.class);

Fileinputformat.addinputpath (Job, Getinputpath ());

Fileoutputformat.setoutputpath (Job, Getoutputpath ());

if (!job.waitforcompletion (true)) {

throw new Interruptedexception ("Canopy Job failed processing" + getinputpath ());

}

return 0;

}

public static class RM extends Mapper<text,clusterwritable, text,text>{

Private Logger Log=loggerfactory.getlogger (Rm.class);

public void Map (Text key,clusterwritable value,context Context) throws

ioexception,interruptedexception{

String Str=value.getvalue (). Getcenter (). asformatstring ();

System.out.println ("center****************:" +str);

Log.info ("center*****************************:" +str); Set Log information

Context.write (Key, New Text (str));

}

}

}


Package to Clusteringutils.jar, upload to/home/hadoop/mahout/mahout_jar


If you need to clear the information in launch configuration in Eclipse, you need to go to the/.metadata/.plugins/org.eclipse.debug.core/.launches in the folder where the project is located

And then delete the files inside.


Run

Hadoop jar Clusteringutils.jar Mahout.fansy.utils.readclusterwritable-i/output/canopy/clusters-0-final/ Part-r-00000-o/output/canopy-output (Run the following command if it is not successful)

Hadoop jar Clusteringutils.jar-i/output/canopy/clusters-0-final/part-r-00000-o/output/canopy-output


At this point, the/output/canopy-output/part-m-00000 is a clustered result file.


-mahout-canopy Clustering practice of data mining

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.