Mahout Bayesian algorithm expansion 3 --- classification of unlabeled data

Last Update:2014-07-20 Source: Internet

Author: User

Tags hadoop fs

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Code test environment: hadoop2.4 + mahout1.0

Previous blog: mahout Bayesian algorithm development ideas (expansion) 1 and mahout Bayesian algorithm development ideas (expansion) 2 analyzed the processing of Bayesian algorithms in mahout For numeric data. In the previous two blogs, there were no examples of how to classify raw data without tags. The following blog will process such data.

The latest version (suitable for the hadoop2.4 + mahout1.0 environment) source code and jar packages can be downloaded here for the mahout Bayesian Classification without tag data:

After downloading the SDK, see use the FZ. BAYes. model. bayesrunner in the jar package to call the Bayesian model to create an algorithm. We will not introduce it here. The following describes the idea of classifying unlabeled data.

Input data:

0.2,0.3,0.40.32,0.43,0.450.23,0.33,0.542.4,2.5,2.62.3,2.2,2.15.4,7.2,7.25.6,7,65.8,7.1,6.36,6,5.411,12,13

Compared with the raw data, this data is less than the label of the last column.

Classification main program:

Package FZ. bayes; import Java. io. bufferedreader; import Java. io. ioexception; import Java. io. inputstreamreader; import Java. util. map; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. FS. fsdatainputstream; import Org. apache. hadoop. FS. fsdataoutputstream; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapred Uce. job; import Org. apache. hadoop. mapreduce. lib. input. fileinputformat; import Org. apache. hadoop. mapreduce. lib. input. textinputformat; import Org. apache. hadoop. mapreduce. lib. output. fileoutputformat; import Org. apache. hadoop. mapreduce. lib. output. textoutputformat; import Org. apache. hadoop. util. toolrunner; import Org. apache. mahout. classifier. naivebayes. abstractnaivebayesclassifier; import Org. apache. mahout. cl Assifier. naivebayes. bayesutils; import Org. apache. mahout. classifier. naivebayes. naivebayesmodel; import Org. apache. mahout. classifier. naivebayes. standardna ivebayesclassifier; import Org. apache. mahout. classifier. naivebayes. training. weightsmapper; import Org. apache. mahout. common. abstractjob; import Org. apache. mahout. common. hadooputil; import Org. apache. mahout. math. vector;/*** job used for classification * For * [* 2.1, 3.2, 1. 2 2.1, 3.2, 1.3] data for classification (that is, data without tags) * @ author fansy **/public class bayesclassifiedjob extends abstractjob {/*** @ Param ARGs * @ throws exception */public static void main (string [] ARGs) throws exception {toolrunner. run (new configuration (), new bayesclassifiedjob (), argS) ;}@ overridepublic int run (string [] ARGs) throws exception {addinputoption (); addoutputoption (); addoption ("model", "M", "the file Where Bayesian model store "); addoption (" labelindex "," labelindex "," the file where the index store "); addoption (" labelnumber "," ln ", "The labels number"); addoption ("mapreduce", "Mr", "whether use mapreduce, true use, else not use"); addoption ("SV ", "SV", "the input vector splitter, default is comma", ","); If (parsearguments (ARGs) = NULL) {return-1 ;} configuration conf = getconf (); Path input = G Etinputpath (); Path output = getoutputpath (); string labelnumber = getoption ("labelnumber"); string modelpath = getoption ("model "); string usemr = getoption ("mapreduce"); string SV = getoption ("SV"); string labelindex = getoption ("labelindex"); int returncode =-1; if ("true ". endswith (usemr) {returncode = usemrtoclassify (Conf, labelnumber, modelpath, input, output, SV, labelindex);} else {returncode = classif Y (Conf, input, output, labelnumber, modelpath, SV, labelindex);} return returncode ;} /*** standalone version ** @ Param conf * @ Param input * @ Param output * @ Param labelnumber * @ Param modelpath * @ Param SV * @ Param labelindex * @ return * @ throws ioexception * @ throws illegalargumentexception */private int classify (configuration Conf, path input, path output, string labelnumber, string modelpath, string SV, string L Abelindex) {// read the model parameter try {naivebayesmodel model = naivebayesmodel. materialize (New Path (modelpath), conf); abstractnaivebayesclassifier classifier = new standardna ivebayesclassifier (model); Map <integer, string> labelmap = bayesutils. readlabelindex (Conf, New Path (labelindex); Path outputpath = New Path (output, "result"); // read files by row, and write the classification result to another file, filesystem FS = filesystem. get (input. touri (), conf); FSDA Tainputstream in = FS. open (input); inputstreamreader istr = new inputstreamreader (in); bufferedreader BR = new bufferedreader (istr); If (FS. exists (outputpath) {FS. delete (outputpath, true);} fsdataoutputstream out = FS. create (outputpath); string lines; stringbuffer buff = new stringbuffer (); While (lines = BR. readline ())! = NULL &&! "". Equals (lines) {string [] line = lines. tostring (). split (SV); If (line. length <1) {break;} vector original = bayesutil. transformtovector (line); vector result = classifier. classifyfull (original); string label = bayesutil. classifyvector (result, labelmap); buff. append (lines + SV + label + "\ n"); // out. writeutf (lines + SV + label); // out .} out. writeutf (buff. substring (0, Buff. length ()-1); out. flush (); out. close (); BR. close (); istr. close (); In. close (); // FS. close ();} catch (exception e) {e. printstacktrace (); Return-1;} return 0 ;} /*** Mr version ** @ Param conf * @ Param labelnumber * @ Param modelpath * @ Param input * @ Param output * @ Param SV * @ Param labelindex * @ return * @ throws ioexception * @ throws classnotfoundexception * @ throws interruptedexception */private int usemrtoclassify (configuration Conf, string labelnumber, string modelpath, path input, path output, string SV, string labelindex) throws ioexception, classnotfoundexception, interruptedexception {Conf. set (weightsmapper. class. getname () + ". numlabels ", labelnumber); Conf. set ("SV", SV); Conf. set ("labelindex", labelindex); hadooputil. cachefiles (New Path (modelpath), conf); hadooputil. delete (Conf, output); job = job. getinstance (Conf, ""); job. setjobname ("use Bayesian model to classify the input:" + input. getname (); job. setjarbyclass (bayesclassifiedjob. class); job. setinputformatclass (textinputformat. class); job. setoutputformatclass (textoutputformat. class); job. setmapperclass (bayesclassifymapper. class); job. setmapoutputkeyclass (text. class); job. setmapoutputvalueclass (text. class); job. setnumreducetasks (0); job. setoutputkeyclass (text. class); job. setoutputvalueclass (text. class); fileinputformat. setinputpaths (job, input); fileoutputformat. setoutputpath (job, output); If (job. waitforcompletion (true) {return 0;} return-1 ;}}

If Mr is used, the Mapper is as follows:

Package FZ. bayes; import Java. io. ioexception; import Java. util. map; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. filecache. distributedcache; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. longwritable; import Org. apache. hadoop. io. text; import Org. apache. hadoop. mapreduce. mapper; import Org. apache. mahout. classifier. naivebayes. abstractnaivebayesclassifier; import Org. apache. mahout. classifier. naivebayes. bayesutils; import Org. apache. mahout. classifier. naivebayes. naivebayesmodel; import Org. apache. mahout. classifier. naivebayes. standardna ivebayesclassifier; import Org. apache. mahout. math. vector;/*** custom er, output the current value and classification result * @ author administrator **/@ suppresswarnings ("deprecation") public class bayesclassifymapper extends mapper <longwritable, text, text, text> {private abstractnaivebayesclassifier classifier; private string Sv; private Map <integer, string> labelmap; private string labelindex; @ override public void setup (context) throws ioexception, interruptedexception {configuration conf = context. getconfiguration (); Path modelpath = New Path (distributedcache. getcachefiles (CONF) [0]. getpath (); naivebayesmodel model = naivebayesmodel. materialize (modelpath, conf); classifier = new standardna ivebayesclassifier (model); Sv = Conf. get ("SV"); labelindex = Conf. get ("labelindex"); labelmap = bayesutils. readlabelindex (Conf, New Path (labelindex) ;}@ override public void map (longwritable key, text value, context) throws ioexception, interruptedexception {string values = value. tostring (); If ("". equals (values) {context. getcounter ("records", "bad record "). increment (1); return;} string [] line = values. split (SV); vector original = bayesutil. transformtovector (line); vector result = classifier. classifyfull (original); string label = bayesutil. classifyvector (result, labelmap); // The key is the vector context. write (value, new text (Label ));}}

Tools used:

Package FZ. bayes; import Java. util. map; import Org. apache. mahout. classifier. classifierresult; import Org. apache. mahout. math. randomaccesssparsevector; import Org. apache. mahout. math. vector; public class bayesutil {/*** convert the input string to vector * @ Param lines * @ return */public static vector transformtovector (string [] line) {vector v = new randomaccesssparsevector (line. length); For (INT I = 0; I <line. length; I ++) {double it Em = 0; try {item = double. parsedouble (line [I]);} catch (exception e) {return NULL; // If the conversion is not allowed, the input data is incorrect.} v. setquick (I, item);} return V ;} /*** classification by score ** @ Param v * @ Param labelmap * @ return */public static string classifyvector (vector V, Map <integer, string> labelmap) {int bestidx = integer. min_value; double bestscore = long. min_value; For (vector. element element: v. all () {If (element. get ()> bestscore) {best Score = element. Get (); bestidx = element. Index () ;}} if (bestidx! = Integer. min_value) {classifierresult = new classifierresult (labelmap. Get (bestidx), bestscore); Return classifierresult. getlabel ();} return NULL ;}}

Here we will analyze the following ideas (refer to the standalone version code or mapper code ):

1. Read the model, the parameter model path, the label encoding file (labelindex. Bin), the number of labels (labelnumber), and initialize model-related variables according to the relevant path;

2. for each record, such as 0.2, 0.3, and 0.4, this record is vectorized Based on SV (the separator of the input path vector) to obtain the vector (0 = 0.2, 1 = 0.3, 2 = 0.4 );

3. Calculate the score of each tag using the model. The obtained score is also a vector. The score vector result = classifier. classifyfull (original) of each tag is recorded. That is, the result vector;

4. Based on the tag score, determine the tag to which the record belongs, and finally decomcode the record (because the tag is encoded, it needs to be reversed here ).

The output result is as follows:

Mr version:

Standalone version:

We can see that there is a garbled code in the first line of output for the standalone version, which does not affect reading data using hadoop FS-cat.

Share, grow, and be happy

Reprinted please indicate blog address: http://blog.csdn.net/fansy1990

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More