-mahout-canopy Clustering practice of data mining

Last Update:2015-10-23 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

1. Explanation of principle

(1) The original data collection list is sorted according to certain rules, and the initial distance threshold is set to T1, T2,t1>t2.

(2) Randomly select a data vector A in the list, using a rough distance calculation method to calculate the distance d between a and the other sample data vectors in the list.

(3) According to the distance d in 2, the sample data vector of D less than T1 is divided into a canopy, and the sample data vector of D less than T2 is removed from the list.

(4) Repeat 2, 3, until list is empty

2. Download test data

Cd/tmp

Hadoop dfs-mkdir/input

wget Http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data

Hadoop Dfs-copyfromlocal/tmp/synthetic_control.data/input/synthetic_control.data

3. Format conversion (text → Vector)

Edit File Text2vectorwritable.jar

Package mahout.fansy.utils.transform;

Import java.io.IOException;

Import org.apache.hadoop.conf.Configuration;

Import Org.apache.hadoop.fs.Path;

Import org.apache.hadoop.io.LongWritable;

Import Org.apache.hadoop.io.Text;

Import Org.apache.hadoop.mapreduce.Job;

Import Org.apache.hadoop.mapreduce.Mapper;

Import Org.apache.hadoop.mapreduce.Reducer;

Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

Import Org.apache.hadoop.mapreduce.lib.output.SequenceFileOutputFormat;

Import Org.apache.hadoop.util.ToolRunner;

Import Org.apache.mahout.common.AbstractJob;

Import Org.apache.mahout.math.RandomAccessSparseVector;

Import Org.apache.mahout.math.Vector;

Import org.apache.mahout.math.VectorWritable;

/**

*--* transform text data to vectorwritable data

*--* @author Fansy

* --*

* --*/

public class Text2vectorwritable extends abstractjob{

public static void Main (string[] args) throws exception{

Toolrunner.run (New Configuration (), New text2vectorwritable (), args);

}

@Override

public int run (string[] arg0) throws Exception {

Addinputoption ();

Addoutputoption ();

if (parsearguments (arg0) = = null) {

return-1;

}

Path Input=getinputpath ();

Path Output=getoutputpath ();

Configuration conf=getconf ();

Set Job information

Job Job=new Job (conf, "text2vectorwritablecopy with Input:" +input.getname ());

Job.setoutputformatclass (Sequencefileoutputformat.class);

Job.setmapperclass (Text2vectorwritablemapper.class);

Job.setmapoutputkeyclass (Longwritable.class);

Job.setmapoutputvalueclass (Vectorwritable.class);

Job.setreducerclass (Text2vectorwritablereducer.class);

Job.setoutputkeyclass (Longwritable.class);

Job.setoutputvalueclass (Vectorwritable.class);

Job.setjarbyclass (Text2vectorwritable.class);

Fileinputformat.addinputpath (Job, input);

Sequencefileoutputformat.setoutputpath (job, output);

if (!job.waitforcompletion (True)) {//wait for the job was done

throw new Interruptedexception ("Canopy Job failed processing" + input);

}

return 0;

}

/**

* Mapper Main procedure

* @author Fansy

--*/

public static class Text2vectorwritablemapper extends mapper<longwritable,text,longwritable,vectorwritable>{

public void Map (longwritable key,text value,context Context) throws ioexception,interruptedexception{

String[] Str=value.tostring (). Split ("\\s{1,}");

Split data Use one or more blanker

Vector vector=new Randomaccesssparsevector (str.length);

for (int i=0;i<str.length;i++) {

Vector.set (i, double.parsedouble (Str[i]));

}

vectorwritable va=new vectorwritable (vector);

Context.write (Key, VA);

}

/**

* Reducer:do nothing but output

* @author Fansy

--*/

public static class Text2vectorwritablereducer extends Reducer<longwritable,vectorwritable,longwritable, vectorwritable>{

public void reduce (longwritable key,iterable<vectorwritable> values,context Context) throws IOException, interruptedexception{

for (vectorwritable v:values) {

Context.write (key, V);

}

Compile, output Clusteringutils.jar, and copy to/home/mahout/mahout_jar

Select Export→runnable jar file→extract required libraries into generated jar when output

Then execute:

Hadoop Jar/home/hadoop/mahout/mahout_jar/clusteringutils.jar Mahout.fansy.utils.transform.text2vectorwritable-i Hdfs:///input/synthetic_control.data-o Hdfs:///input/synthetic_control.data.transform

It is possible to encounter Org/apache/mahout/common/abstractjob cannot find the class error, this is generally due to the Hadoop_classpath configuration location does not include the Mahout jar.

Workaround 1:

Copy the Mahout jar file into the/home/hadoop/lib and confirm that the/home/hadoop/lib is indeed in Hadoop_classpath

Cp/home/hadoop/mahout/*.jar/home/hadoop/hadoop/lib

Workaround 2 (Recommended):

Join in the hadoop-env.sh

for f In/home/hadoop/mahout/*.jar; Do

If ["$HADOOP _classpath"]; Then

Export hadoop_classpath= $HADOOP _classpath: $f

Else

Export Hadoop_classpath= $f

Done

Remember to distribute hadoop-evn.sh to other nodes

Restarting the Hadoop environment

stop-all.sh

start-all.sh

To perform the conversion:

Hadoop Jar/home/hadoop/mahout/mahout_jar/clusteringutils.jar Mahout.fansy.utils.transform.text2vectorwritable-i Hdfs:///input/synthetic_control.data-o Hdfs:///input/synthetic_ Control.data.transform (if the main class is already assigned when the jar is exported, this command will error, use the following command)

Hadoop Jar/home/hadoop/mahout/mahout_jar/clusteringutils.jar-o Hdfs:///input/synthetic_control.data.transform

The output files are already unrecognizable vector files.

hdfs:///input/synthetic_control.data.transform/part-r-00000

4, the implementation of canopy clustering

Mahout Canopy--input hdfs:///input/synthetic_control.data.transform/part-r-00000--output/output/canopy-- Distancemeasure org.apache.mahout.common.distance.EuclideanDistanceMeansure--t1--t2--t3

5. Conversion format (vector → text)

Convert the result in 4 to a text

Edit File Readclusterwritable.java

Package mahout.fansy.utils;

Import java.io.IOException;

Import org.apache.hadoop.conf.Configuration;

Import Org.apache.hadoop.io.Text;

Import Org.apache.hadoop.mapreduce.Job;

Import Org.apache.hadoop.mapreduce.Mapper;

Import Org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

Import Org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat;

Import Org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

Import Org.apache.hadoop.util.ToolRunner;

Import org.apache.mahout.clustering.iterator.ClusterWritable;

Import Org.apache.mahout.common.AbstractJob;

Import Org.slf4j.Logger;

Import Org.slf4j.LoggerFactory;

/**

?? * Read Cluster centers

?? * @author Fansy

?? */

public class Readclusterwritable extends Abstractjob {

public static void Main (string[] args) throws exception{

Toolrunner.run (New Configuration (), New readclusterwritable (), args);

}

@Override

public int run (string[] args) throws Exception {

Addinputoption ();

Addoutputoption ();

if (parsearguments (args) = = null) {

return-1;

}

Job Job=new Job (getconf (), Getinputpath (). toString ());

Job.setinputformatclass (Sequencefileinputformat.class);

Job.setmapperclass (Rm.class);

Job.setmapoutputkeyclass (Text.class);

Job.setmapoutputvalueclass (Text.class);

Job.setnumreducetasks (0);

Job.setjarbyclass (Readclusterwritable.class);

Fileinputformat.addinputpath (Job, Getinputpath ());

Fileoutputformat.setoutputpath (Job, Getoutputpath ());

if (!job.waitforcompletion (true)) {

throw new Interruptedexception ("Canopy Job failed processing" + getinputpath ());

}

return 0;

}

public static class RM extends Mapper<text,clusterwritable, text,text>{

Private Logger Log=loggerfactory.getlogger (Rm.class);

public void Map (Text key,clusterwritable value,context Context) throws

ioexception,interruptedexception{

String Str=value.getvalue (). Getcenter (). asformatstring ();

System.out.println ("center****************:" +str);

Log.info ("center*****************************:" +str); Set Log information

Context.write (Key, New Text (str));

}

Package to Clusteringutils.jar, upload to/home/hadoop/mahout/mahout_jar

If you need to clear the information in launch configuration in Eclipse, you need to go to the/.metadata/.plugins/org.eclipse.debug.core/.launches in the folder where the project is located

And then delete the files inside.

Run

Hadoop jar Clusteringutils.jar Mahout.fansy.utils.readclusterwritable-i/output/canopy/clusters-0-final/ Part-r-00000-o/output/canopy-output (Run the following command if it is not successful)

Hadoop jar Clusteringutils.jar-i/output/canopy/clusters-0-final/part-r-00000-o/output/canopy-output

At this point, the/output/canopy-output/part-m-00000 is a clustered result file.

-mahout-canopy Clustering practice of data mining

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

-mahout-canopy Clustering practice of data mining

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support