1. Explanation of principle
(1) The original data collection list is sorted according to certain rules, and the initial distance threshold is set to T1, T2,t1>t2.
(2) Randomly select a data vector A in the list, using a rough distance calculation method to calculate the distance d between a and the other sample data vectors in the list.
(3) According to the distance d in 2, the sample data vector of D less than T1 is divided into a canopy, and the sample data vector of D less than T2 is removed from the list.
(4) Repeat 2, 3, until list is empty
2. Download test data
Cd/tmp
Hadoop dfs-mkdir/input
wget Http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data
Hadoop Dfs-copyfromlocal/tmp/synthetic_control.data/input/synthetic_control.data
3. Format conversion (text → Vector)
Edit File Text2vectorwritable.jar
Package mahout.fansy.utils.transform;
Import java.io.IOException;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.fs.Path;
Import org.apache.hadoop.io.LongWritable;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.Mapper;
Import Org.apache.hadoop.mapreduce.Reducer;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;
Import Org.apache.hadoop.util.ToolRunner;
Import Org.apache.mahout.common.AbstractJob;
Import Org.apache.mahout.math.RandomAccessSparseVector;
Import Org.apache.mahout.math.Vector;
Import org.apache.mahout.math.VectorWritable;
/**
*--* transform text data to vectorwritable data
*--* @author Fansy
* --*
* --*/
public class Text2vectorwritable extends abstractjob{
public static void Main (string[] args) throws exception{
Toolrunner.run (New Configuration (), New text2vectorwritable (), args);
}
@Override
public int run (string[] arg0) throws Exception {
Addinputoption ();
Addoutputoption ();
if (parsearguments (arg0) = = null) {
return-1;
}
Path Input=getinputpath ();
Path Output=getoutputpath ();
Configuration conf=getconf ();
Set Job information
Job Job=new Job (conf, "text2vectorwritablecopy with Input:" +input.getname ());
Job.setoutputformatclass (Sequencefileoutputformat.class);
Job.setmapperclass (Text2vectorwritablemapper.class);
Job.setmapoutputkeyclass (Longwritable.class);
Job.setmapoutputvalueclass (Vectorwritable.class);
Job.setreducerclass (Text2vectorwritablereducer.class);
Job.setoutputkeyclass (Longwritable.class);
Job.setoutputvalueclass (Vectorwritable.class);
Job.setjarbyclass (Text2vectorwritable.class);
Fileinputformat.addinputpath (Job, input);
Sequencefileoutputformat.setoutputpath (job, output);
if (!job.waitforcompletion (True)) {//wait for the job was done
throw new Interruptedexception ("Canopy Job failed processing" + input);
}
return 0;
}
/**
* Mapper Main procedure
* @author Fansy
*
--*/
public static class Text2vectorwritablemapper extends mapper<longwritable,text,longwritable,vectorwritable>{
public void Map (longwritable key,text value,context Context) throws ioexception,interruptedexception{
String[] Str=value.tostring (). Split ("\\s{1,}");
Split data Use one or more blanker
Vector vector=new Randomaccesssparsevector (str.length);
for (int i=0;i<str.length;i++) {
Vector.set (i, double.parsedouble (Str[i]));
}
vectorwritable va=new vectorwritable (vector);
Context.write (Key, VA);
}
}
/**
* Reducer:do nothing but output
* @author Fansy
*
--*/
public static class Text2vectorwritablereducer extends Reducer<longwritable,vectorwritable,longwritable, vectorwritable>{
public void reduce (longwritable key,iterable<vectorwritable> values,context Context) throws IOException, interruptedexception{
for (vectorwritable v:values) {
Context.write (key, V);
}
}
}
}
Compile, output Clusteringutils.jar, and copy to/home/mahout/mahout_jar
Select Export→runnable jar file→extract required libraries into generated jar when output
Then execute:
Hadoop Jar/home/hadoop/mahout/mahout_jar/clusteringutils.jar Mahout.fansy.utils.transform.text2vectorwritable-i Hdfs:///input/synthetic_control.data-o Hdfs:///input/synthetic_control.data.transform
It is possible to encounter Org/apache/mahout/common/abstractjob cannot find the class error, this is generally due to the Hadoop_classpath configuration location does not include the Mahout jar.
Workaround 1:
Copy the Mahout jar file into the/home/hadoop/lib and confirm that the/home/hadoop/lib is indeed in Hadoop_classpath
Cp/home/hadoop/mahout/*.jar/home/hadoop/hadoop/lib
Workaround 2 (Recommended):
Join in the hadoop-env.sh
for f In/home/hadoop/mahout/*.jar; Do
If ["$HADOOP _classpath"]; Then
Export hadoop_classpath= $HADOOP _classpath: $f
Else
Export Hadoop_classpath= $f
Fi
Done
Remember to distribute hadoop-evn.sh to other nodes
Restarting the Hadoop environment
stop-all.sh
start-all.sh
To perform the conversion:
Hadoop Jar/home/hadoop/mahout/mahout_jar/clusteringutils.jar Mahout.fansy.utils.transform.text2vectorwritable-i Hdfs:///input/synthetic_control.data-o Hdfs:///input/synthetic_ Control.data.transform (if the main class is already assigned when the jar is exported, this command will error, use the following command)
Hadoop Jar/home/hadoop/mahout/mahout_jar/clusteringutils.jar-o Hdfs:///input/synthetic_control.data.transform
The output files are already unrecognizable vector files.
hdfs:///input/synthetic_control.data.transform/part-r-00000
4, the implementation of canopy clustering
Mahout Canopy--input hdfs:///input/synthetic_control.data.transform/part-r-00000--output/output/canopy-- Distancemeasure org.apache.mahout.common.distance.EuclideanDistanceMeansure--t1--t2--t3
5. Conversion format (vector → text)
Convert the result in 4 to a text
Edit File Readclusterwritable.java
Package mahout.fansy.utils;
Import java.io.IOException;
Import org.apache.hadoop.conf.Configuration;
Import Org.apache.hadoop.io.Text;
Import Org.apache.hadoop.mapreduce.Job;
Import Org.apache.hadoop.mapreduce.Mapper;
Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;
Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
Import Org.apache.hadoop.util.ToolRunner;
Import org.apache.mahout.clustering.iterator.ClusterWritable;
Import Org.apache.mahout.common.AbstractJob;
Import Org.slf4j.Logger;
Import Org.slf4j.LoggerFactory;
/**
?? * Read Cluster centers
?? * @author Fansy
?? */
public class Readclusterwritable extends Abstractjob {
public static void Main (string[] args) throws exception{
Toolrunner.run (New Configuration (), New readclusterwritable (), args);
}
@Override
public int run (string[] args) throws Exception {
Addinputoption ();
Addoutputoption ();
if (parsearguments (args) = = null) {
return-1;
}
Job Job=new Job (getconf (), Getinputpath (). toString ());
Job.setinputformatclass (Sequencefileinputformat.class);
Job.setmapperclass (Rm.class);
Job.setmapoutputkeyclass (Text.class);
Job.setmapoutputvalueclass (Text.class);
Job.setnumreducetasks (0);
Job.setjarbyclass (Readclusterwritable.class);
Fileinputformat.addinputpath (Job, Getinputpath ());
Fileoutputformat.setoutputpath (Job, Getoutputpath ());
if (!job.waitforcompletion (true)) {
throw new Interruptedexception ("Canopy Job failed processing" + getinputpath ());
}
return 0;
}
public static class RM extends Mapper<text,clusterwritable, text,text>{
Private Logger Log=loggerfactory.getlogger (Rm.class);
public void Map (Text key,clusterwritable value,context Context) throws
ioexception,interruptedexception{
String Str=value.getvalue (). Getcenter (). asformatstring ();
System.out.println ("center****************:" +str);
Log.info ("center*****************************:" +str); Set Log information
Context.write (Key, New Text (str));
}
}
}
Package to Clusteringutils.jar, upload to/home/hadoop/mahout/mahout_jar
If you need to clear the information in launch configuration in Eclipse, you need to go to the/.metadata/.plugins/org.eclipse.debug.core/.launches in the folder where the project is located
And then delete the files inside.
Run
Hadoop jar Clusteringutils.jar Mahout.fansy.utils.readclusterwritable-i/output/canopy/clusters-0-final/ Part-r-00000-o/output/canopy-output (Run the following command if it is not successful)
Hadoop jar Clusteringutils.jar-i/output/canopy/clusters-0-final/part-r-00000-o/output/canopy-output
At this point, the/output/canopy-output/part-m-00000 is a clustered result file.
-mahout-canopy Clustering practice of data mining