Code test environment: hadoop2.4 + mahout1.0
Previous blog: mahout Bayesian algorithm development ideas (expansion) 1 and mahout Bayesian algorithm development ideas (expansion) 2 analyzed the processing of Bayesian algorithms in mahout For numeric data. In the previous two blogs, there were no examples of how to classify raw data without tags. The following blog will process such data.
The latest version (suitable for the hadoop2.4 + mahout1.0 environment) source code and jar packages can be downloaded here for the mahout Bayesian Classification without tag data:
After downloading the SDK, see use the FZ. BAYes. model. bayesrunner in the jar package to call the Bayesian model to create an algorithm. We will not introduce it here. The following describes the idea of classifying unlabeled data.
Input data:
0.2,0.3,0.40.32,0.43,0.450.23,0.33,0.542.4,2.5,2.62.3,2.2,2.15.4,7.2,7.25.6,7,65.8,7.1,6.36,6,5.411,12,13
Compared with the raw data, this data is less than the label of the last column.
Classification main program:
Package FZ. bayes; import Java. io. bufferedreader; import Java. io. ioexception; import Java. io. inputstreamreader; import Java. util. map; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. fsdatainputstream; import Org. apache. hadoop. FS. fsdataoutputstream; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapred Uce. job; import Org. apache. hadoop. mapreduce. lib. input. fileinputformat; import Org. apache. hadoop. mapreduce. lib. input. textinputformat; import Org. apache. hadoop. mapreduce. lib. output. fileoutputformat; import Org. apache. hadoop. mapreduce. lib. output. textoutputformat; import Org. apache. hadoop. util. toolrunner; import Org. apache. mahout. classifier. naivebayes. abstractnaivebayesclassifier; import Org. apache. mahout. cl Assifier. naivebayes. bayesutils; import Org. apache. mahout. classifier. naivebayes. naivebayesmodel; import Org. apache. mahout. classifier. naivebayes. standardna ivebayesclassifier; import Org. apache. mahout. classifier. naivebayes. training. weightsmapper; import Org. apache. mahout. common. abstractjob; import Org. apache. mahout. common. hadooputil; import Org. apache. mahout. math. vector;/*** job used for classification * For * [* 2.1, 3.2, 1. 2 2.1, 3.2, 1.3] data for classification (that is, data without tags) * @ author fansy **/public class bayesclassifiedjob extends abstractjob {/*** @ Param ARGs * @ throws exception */public static void main (string [] ARGs) throws exception {toolrunner. run (new configuration (), new bayesclassifiedjob (), argS) ;}@ overridepublic int run (string [] ARGs) throws exception {addinputoption (); addoutputoption (); addoption ("model", "M", "the file Where Bayesian model store "); addoption (" labelindex "," labelindex "," the file where the index store "); addoption (" labelnumber "," ln ", "The labels number"); addoption ("mapreduce", "Mr", "whether use mapreduce, true use, else not use"); addoption ("SV ", "SV", "the input vector splitter, default is comma", ","); If (parsearguments (ARGs) = NULL) {return-1 ;} configuration conf = getconf (); Path input = G Etinputpath (); Path output = getoutputpath (); string labelnumber = getoption ("labelnumber"); string modelpath = getoption ("model "); string usemr = getoption ("mapreduce"); string SV = getoption ("SV"); string labelindex = getoption ("labelindex"); int returncode =-1; if ("true ". endswith (usemr) {returncode = usemrtoclassify (Conf, labelnumber, modelpath, input, output, SV, labelindex);} else {returncode = classif Y (Conf, input, output, labelnumber, modelpath, SV, labelindex);} return returncode ;} /*** standalone version ** @ Param conf * @ Param input * @ Param output * @ Param labelnumber * @ Param modelpath * @ Param SV * @ Param labelindex * @ return * @ throws ioexception * @ throws illegalargumentexception */private int classify (configuration Conf, path input, path output, string labelnumber, string modelpath, string SV, string L Abelindex) {// read the model parameter try {naivebayesmodel model = naivebayesmodel. materialize (New Path (modelpath), conf); abstractnaivebayesclassifier classifier = new standardna ivebayesclassifier (model); Map <integer, string> labelmap = bayesutils. readlabelindex (Conf, New Path (labelindex); Path outputpath = New Path (output, "result"); // read files by row, and write the classification result to another file, filesystem FS = filesystem. get (input. touri (), conf); FSDA Tainputstream in = FS. open (input); inputstreamreader istr = new inputstreamreader (in); bufferedreader BR = new bufferedreader (istr); If (FS. exists (outputpath) {FS. delete (outputpath, true);} fsdataoutputstream out = FS. create (outputpath); string lines; stringbuffer buff = new stringbuffer (); While (lines = BR. readline ())! = NULL &&! "". Equals (lines) {string [] line = lines. tostring (). split (SV); If (line. length <1) {break;} vector original = bayesutil. transformtovector (line); vector result = classifier. classifyfull (original); string label = bayesutil. classifyvector (result, labelmap); buff. append (lines + SV + label + "\ n"); // out. writeutf (lines + SV + label); // out .} out. writeutf (buff. substring (0, Buff. length ()-1); out. flush (); out. close (); BR. close (); istr. close (); In. close (); // FS. close ();} catch (exception e) {e. printstacktrace (); Return-1;} return 0 ;} /*** Mr version ** @ Param conf * @ Param labelnumber * @ Param modelpath * @ Param input * @ Param output * @ Param SV * @ Param labelindex * @ return * @ throws ioexception * @ throws classnotfoundexception * @ throws interruptedexception */private int usemrtoclassify (configuration Conf, string labelnumber, string modelpath, path input, path output, string SV, string labelindex) throws ioexception, classnotfoundexception, interruptedexception {Conf. set (weightsmapper. class. getname () + ". numlabels ", labelnumber); Conf. set ("SV", SV); Conf. set ("labelindex", labelindex); hadooputil. cachefiles (New Path (modelpath), conf); hadooputil. delete (Conf, output); job = job. getinstance (Conf, ""); job. setjobname ("use Bayesian model to classify the input:" + input. getname (); job. setjarbyclass (bayesclassifiedjob. class); job. setinputformatclass (textinputformat. class); job. setoutputformatclass (textoutputformat. class); job. setmapperclass (bayesclassifymapper. class); job. setmapoutputkeyclass (text. class); job. setmapoutputvalueclass (text. class); job. setnumreducetasks (0); job. setoutputkeyclass (text. class); job. setoutputvalueclass (text. class); fileinputformat. setinputpaths (job, input); fileoutputformat. setoutputpath (job, output); If (job. waitforcompletion (true) {return 0;} return-1 ;}}
If Mr is used, the Mapper is as follows:
Package FZ. bayes; import Java. io. ioexception; import Java. util. map; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. filecache. distributedcache; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. longwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. mapper; import Org. apache. mahout. classifier. naivebayes. abstractnaivebayesclassifier; import Org. apache. mahout. classifier. naivebayes. bayesutils; import Org. apache. mahout. classifier. naivebayes. naivebayesmodel; import Org. apache. mahout. classifier. naivebayes. standardna ivebayesclassifier; import Org. apache. mahout. math. vector;/*** custom er, output the current value and classification result * @ author administrator **/@ suppresswarnings ("deprecation") public class bayesclassifymapper extends mapper <longwritable, text, text, text> {private abstractnaivebayesclassifier classifier; private string Sv; private Map <integer, string> labelmap; private string labelindex; @ override public void setup (context) throws ioexception, interruptedexception {configuration conf = context. getconfiguration (); Path modelpath = New Path (distributedcache. getcachefiles (CONF) [0]. getpath (); naivebayesmodel model = naivebayesmodel. materialize (modelpath, conf); classifier = new standardna ivebayesclassifier (model); Sv = Conf. get ("SV"); labelindex = Conf. get ("labelindex"); labelmap = bayesutils. readlabelindex (Conf, New Path (labelindex) ;}@ override public void map (longwritable key, text value, context) throws ioexception, interruptedexception {string values = value. tostring (); If ("". equals (values) {context. getcounter ("records", "bad record "). increment (1); return;} string [] line = values. split (SV); vector original = bayesutil. transformtovector (line); vector result = classifier. classifyfull (original); string label = bayesutil. classifyvector (result, labelmap); // The key is the vector context. write (value, new text (Label ));}}
Tools used:
Package FZ. bayes; import Java. util. map; import Org. apache. mahout. classifier. classifierresult; import Org. apache. mahout. math. randomaccesssparsevector; import Org. apache. mahout. math. vector; public class bayesutil {/*** convert the input string to vector * @ Param lines * @ return */public static vector transformtovector (string [] line) {vector v = new randomaccesssparsevector (line. length); For (INT I = 0; I <line. length; I ++) {double it Em = 0; try {item = double. parsedouble (line [I]);} catch (exception e) {return NULL; // If the conversion is not allowed, the input data is incorrect.} v. setquick (I, item);} return V ;} /*** classification by score ** @ Param v * @ Param labelmap * @ return */public static string classifyvector (vector V, Map <integer, string> labelmap) {int bestidx = integer. min_value; double bestscore = long. min_value; For (vector. element element: v. all () {If (element. get ()> bestscore) {best Score = element. Get (); bestidx = element. Index () ;}} if (bestidx! = Integer. min_value) {classifierresult = new classifierresult (labelmap. Get (bestidx), bestscore); Return classifierresult. getlabel ();} return NULL ;}}
Here we will analyze the following ideas (refer to the standalone version code or mapper code ):
1. Read the model, the parameter model path, the label encoding file (labelindex. Bin), the number of labels (labelnumber), and initialize model-related variables according to the relevant path;
2. for each record, such as 0.2, 0.3, and 0.4, this record is vectorized Based on SV (the separator of the input path vector) to obtain the vector (0 = 0.2, 1 = 0.3, 2 = 0.4 );
3. Calculate the score of each tag using the model. The obtained score is also a vector. The score vector result = classifier. classifyfull (original) of each tag is recorded. That is, the result vector;
4. Based on the tag score, determine the tag to which the record belongs, and finally decomcode the record (because the tag is encoded, it needs to be reversed here ).
The output result is as follows:
Mr version:
Standalone version:
We can see that there is a garbled code in the first line of output for the standalone version, which does not affect reading data using hadoop FS-cat.
Share, grow, and be happy
Reprinted please indicate blog address: http://blog.csdn.net/fansy1990