Mahout Bayesian Algorithm Development Chapter 3---classification without tag data

Source: Internet
Author: User
Tags hadoop fs

Code test Environment: hadoop2.4+mahout1.0

Previous blog: mahout Bayesian algorithm Development Ideas (expansion) 1 and mahout Bayesian algorithm development Ideas (expansion) 2 the Bayesian algorithm in Mahout is analyzed to deal with the numerical data. In the previous two blogs, there was no processing of how to classify raw data without labels.

The following blog post deals with this data.

The latest version (for hadoop2.4+mahout1.0 environment) source code and jar package can be downloaded here mahout Bayesian classification does not contain tag data:

Download after the test using the inside of the jar package Fz.bayes.model.BayesRunner call Bayesian model building algorithm, here is not introduced, the following is the classification of non-tagged data ideas.


Input data:

0.2,0.3,0.40.32,0.43,0.450.23,0.33,0.542.4,2.5,2.62.3,2.2,2.15.4,7.2,7.25.6,7,65.8,7.1,6.36,6,5.411,12,13
This data is less than the last label when compared to the original data.

Category Main program:

Package Fz.bayes;import Java.io.bufferedreader;import Java.io.ioexception;import java.io.inputstreamreader;import Java.util.map;import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.fsdatainputstream;import Org.apache.hadoop.fs.fsdataoutputstream;import Org.apache.hadoop.fs.filesystem;import Org.apache.hadoop.fs.Path; Import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.input.textinputformat;import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;import Org.apache.hadoop.mapreduce.lib.output.textoutputformat;import Org.apache.hadoop.util.toolrunner;import Org.apache.mahout.classifier.naivebayes.abstractnaivebayesclassifier;import Org.apache.mahout.classifier.naivebayes.bayesutils;import Org.apache.mahout.classifier.naivebayes.naivebayesmodel;import Org.apache.mahout.classifier.naivebayes.standardnaivebayesclassifier;import Org.apache.mahout.clasSifier.naivebayes.training.weightsmapper;import Org.apache.mahout.common.abstractjob;import Org.apache.mahout.common.hadooputil;import org.apache.mahout.math.vector;/** * Job for classification * for * [* 2.1,3.2,1.2 2.1,3.2,1 .3] Data, classification (i.e. without tag data) * @author Fansy * */public class Bayesclassifiedjob extends Abstractjob {/** * @param args * @ Throws Exception */public static void Main (string[] args) throws Exception {Toolrunner.run (New Configuration (), New Bayes Classifiedjob (), args);}    @Overridepublic int Run (string[] args) throws Exception {addinputoption ();    Addoutputoption ();    AddOption ("Model", "M", "the file where Bayesian model store");    AddOption ("Labelindex", "Labelindex", "The file where the" index store ");    AddOption ("Labelnumber", "ln", "the labels number");    AddOption ("MapReduce", "Mr", "Whether use MapReduce," true "and" Else not ");        AddOption ("SV", "SV", "the input vector splitter, default is comma", ",");    if (parsearguments (args) = = null) {return-1;} ConFiguration conf=getconf ();    Path input = Getinputpath ();    Path output = Getoutputpath ();    String labelnumber=getoption ("Labelnumber");    String modelpath=getoption ("model");    String Usemr = getOption ("MapReduce");    String SV = getOption ("SV");    String labelindex = getOption ("Labelindex");    int returncode=-1;    if ("true". EndsWith (USEMR)) {ReturnCode = Usemrtoclassify (Conf,labelnumber,modelpath,input,output,sv,labelindex);    }else{ReturnCode = classify (conf,input, output, Labelnumber, Modelpath, SV, labelindex); } return ReturnCode;} /** * Single Version * @param conf * @param input * @param output * @param labelnumber * @param modelpath * @param sv * @param labeli Ndex * @return * @throws ioexception * @throws illegalargumentexception */private int classify (Configuration conf, Path  Input, Path output, String labelnumber,string modelpath,string sv,string labelindex) {//Read model parameters Try{naivebayesmodel models = Naivebayesmodel.materialize (new Path (Modelpath), conf); AbstractNaivebayesclassifier classifier = new Standardnaivebayesclassifier (model); Map<integer, string> labelmap = Bayesutils.readlabelindex (conf, new Path (Labelindex)); Path OutputPath =new path (output, "result"); Reads a file by line. and write the results of the classification to another file FileSystem fs =filesystem.get (Input.touri (), Conf);        Fsdatainputstream In=fs.open (input);       InputStreamReader istr=new InputStreamReader (in);       BufferedReader br=new BufferedReader (ISTR);     if (fs.exists (OutputPath)) {Fs.delete (OutputPath, true);          } Fsdataoutputstream out = Fs.create (OutputPath);     String lines;     StringBuffer buff = new StringBuffer (); while ((Lines=br.readline ())!=null&&! "".     Equals (lines)) {string[] line = Lines.tostring (). Split (SV);     if (line.length<1) {break;          } Vector original =bayesutil.transformtovector (line);          Vector result = Classifier.classifyfull (original);          String label = Bayesutil.classifyvector (result, labelmap); Buff.append (lines+sv+label+"\ n");//Out.writeutf (Lines+sv+label);//out.     } out.writeutf (buff.substring (0, Buff.length ()-1));     Out.flush ();     Out.close ();     Br.close ();     Istr.close (); In.close ();//Fs.close ();} catch (Exception e) {e.printstacktrace (); return-1;} return 0;} /** * MR Version * @param conf * @param labelnumber * @param modelpath * @param input * @param output * @param SV * @param label Index * @return * @throws IOException * @throws classnotfoundexception * @throws interruptedexception */private int USEMRT  Oclassify (Configuration conf, string labelnumber, string modelpath, path input, path output, string SV, String labelindex) Throws IOException, ClassNotFoundException, interruptedexception {conf.set (WeightsMapper.class.getName () + ". Numlabe    LS ", labelnumber);    Conf.set ("SV", SV);    Conf.set ("Labelindex", labelindex);    Hadooputil.cachefiles (New Path (Modelpath), conf);    Hadooputil.delete (conf, output);    Job job=job.getinstance (conf, ""); Job.setjobname ("Use Bayesian model to classify the input: "+input.getname ());         Job.setjarbyclass (Bayesclassifiedjob.class);    Job.setinputformatclass (Textinputformat.class);        Job.setoutputformatclass (Textoutputformat.class);    Job.setmapperclass (Bayesclassifymapper.class);    Job.setmapoutputkeyclass (Text.class);    Job.setmapoutputvalueclass (Text.class);    Job.setnumreducetasks (0);    Job.setoutputkeyclass (Text.class);    Job.setoutputvalueclass (Text.class);    Fileinputformat.setinputpaths (Job, input);        Fileoutputformat.setoutputpath (job, output);    if (Job.waitforcompletion (True)) {return 0; }return-1;}}
Assuming that Mr is used, mapper such as the following:

Package Fz.bayes;import Java.io.ioexception;import Java.util.map;import org.apache.hadoop.conf.Configuration; Import Org.apache.hadoop.filecache.distributedcache;import Org.apache.hadoop.fs.path;import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.Mapper; Import Org.apache.mahout.classifier.naivebayes.abstractnaivebayesclassifier;import Org.apache.mahout.classifier.naivebayes.bayesutils;import Org.apache.mahout.classifier.naivebayes.naivebayesmodel;import  Org.apache.mahout.classifier.naivebayes.standardnaivebayesclassifier;import org.apache.mahout.math.Vector;/** * Define mapper yourself. Output current values and results of classification * @author Administrator * */@SuppressWarnings ("deprecation") public class Bayesclassifymapper extends Mapper <longwritable, text, text, text>{private abstractnaivebayesclassifier classifier;private String SV;private Map <integer, string> labelmap;private String labelindex; @Override public void Setup (context context) throws Ioexcepti On, Interruptedexception {Configuration conf = context.getconfiguration ();    Path Modelpath = new Path (distributedcache.getcachefiles (CONF) [0].getpath ());    Naivebayesmodel model = Naivebayesmodel.materialize (Modelpath, conf);    classifier = new Standardnaivebayesclassifier (model);    SV = Conf.get ("SV");  Labelindex=conf.get ("Labelindex"); Labelmap = Bayesutils.readlabelindex (conf, new Path (Labelindex)); } @Override public void map (longwritable key, Text value, Context context) throws IOException, interruptedexception {S  Tring values =value.tostring ();  if ("". Equals (values)) {Context.getcounter ("Records", "Bad Record"). Increment (1);   Return    } string[] line = Values.split (SV);       Vector original =bayesutil.transformtovector (line);       Vector result = Classifier.classifyfull (original);        String label = Bayesutil.classifyvector (result, labelmap);  The key is the vector context.write (value, new Text (label)); }}


Tool classes used:

Package Fz.bayes;import Java.util.map;import Org.apache.mahout.classifier.classifierresult;import Org.apache.mahout.math.randomaccesssparsevector;import Org.apache.mahout.math.vector;public class BayesUtil {/** * Convert input string to vector * @param lines * @return */public static vector transformtovector (string[] line) {vector v=new randomaccess Sparsevector (line.length); for (int i=0;i<line.length;i++) {double item=0;try{item=double.parsedouble (line[i]);} catch (Exception e) {return null;//Assuming the conversion is not possible, indicating a problem with the input data}v.setquick (I, item);} return v;} /** * Classification based on score values * @param v * @param labelmap * @return */public static string Classifyvector (Vector V,map<integer, string > labelmap) {int bestidx = integer.min_value;double Bestscore = long.min_value;for (vector.element element:v.all ()) {I F (Element.get () > Bestscore) {bestscore = Element.get (); bestidx = Element.index ();}} if (bestidx! = integer.min_value) {Classifierresult Classifierresult = new Classifierresult (Labelmap.get (BESTIDX), Bestscore); return clasSifierresult.getlabel ();} return null;}}
Here is a little analysis of the idea (for a stand-alone code or mapper code):

1. Read the model. The parameter model path, the label's encoding file (Labelindex.bin). The number of tags (labelnumber), according to the relevant path, the initialization of model-related variables;

2. For each record. For example 0.2,0.3,0.4. According to the SV (input path vector delimiter), this record is quantized to obtain vector (0=0.2,1=0.3,2=0.4);

3. Using the model to calculate the score of each label, the resulting is also a vector that records the scores of each label vector result = Classifier.classifyfull (original); That is, the result vector;

4. According to the score of the label, which label the record belongs to, and finally the anti-coding (because the label is encoded, so it needs to be anti-coded).

Here we look at the output:

Mr Version:

Stand-alone version:

Can see the standalone version. The first line of output has a garbled, this is actually not affected. It is no problem to read with Hadoop Fs-cat.


Share, grow, be happy

Reprint Please specify blog address: http://blog.csdn.net/fansy1990



Mahout Bayesian Algorithm Development Chapter 3---classification without tag data

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.