Hadoop learning notes (iii) one instance 1. auxiliary class genericoptionsparser, tool and toolrunner
In the previous chapter, the genericoptionsparser class is used to explain common hadoop command line options and set the corresponding values for the configuration object as needed. Generally, the genericoptionsparser class is not directly used. The more convenient method is to implement the tool interface and call it through the toolrunner, And the toolrunner eventually calls the genericoptionsparser class.
public interface Tool extends Configurable { int run(String[] args) throws Exception;}
Then a simple implementation of tool is provided. Here I am used to process the daily log information:
Import Java. util. calendar; import Org. apache. hadoop. conf. configuration; import Org. apache. hadoop. conf. configured; import Org. apache. hadoop. FS. filesystem; import Org. apache. hadoop. FS. path; import Org. apache. hadoop. io. intwritable; import Org. apache. hadoop. io. longwritable; import Org. apache. hadoop. io. sequencefile; import Org. apache. hadoop. io. text; import Org. apache. hadoop. io. compress. snappycodec; import Org. apache. hadoop. mapreduce. job; import Org. apache. hadoop. mapreduce. lib. input. fileinputformat; import Org. apache. hadoop. mapreduce. lib. input. textinputformat; import Org. apache. hadoop. mapreduce. lib. output. fileoutputformat; import Org. apache. hadoop. mapreduce. lib. output. sequencefileoutputformat; import Org. apache. hadoop. util. tool; import Org. apache. hadoop. util. toolrunner; import Org. apache. log4j. logger; public class load extends configured implements tool {private final static logger = logger. getlogger (load. class);/*** @ Param ARGs */public static void main (string [] ARGs) throws exception {int res = toolrunner. run (new configuration (), new load (), argS); system. exit (RES) ;}@ overridepublic int run (string [] ARGs) throws exception {// The final Java data to be processed. util. date; If (ARGs. length> = 1) {string ST = ARGs [0]; Java. text. simpledateformat SDF = new Java. text. simpledateformat ("yyyymmdd"); try {date = SDF. parse (ST);} catch (Java. text. parseexception ex) {Throw new runtimeexception ("input format error," + st) ;}} else {calendar ar Cal = calendar ar. getinstance (); Cal. add (calendar. day_of_month,-1); Date = Cal. gettime ();} job = new job (this. getconf (), "load _" + new Java. text. simpledateformat ("mmdd "). format (date); job. setjarbyclass (load. class); job. setoutputkeyclass (text. class); job. setoutputvalueclass (intwritable. class); job. setmapperclass (loadmapper. class); job. setnumreducetasks (0); job. setinputformatclass (textinputformat. class); job. setoutputkeyclass (longwritable. class); job. setoutputvalueclass (searchlogprotobufwritable. class); job. setoutputformatclass (sequencefileoutputformat. class); fileoutputformat. setcompressoutput (job, true); sequencefileoutputformat. setoutputcompressiontype (job, sequencefile. compressiontype. block); fileoutputformat. setoutputcompressorclass (job, snappycodec. class); // input directory, hadoop file Java. text. simpledateformat SDF = new Java. text. simpledateformat ("yyyymmdd"); string inputdir = "/log/raw/" + SDF. format (date) + "/* access. log *. GZ "; logger.info (" inputdir = "+ inputdir);/* local file Java. text. simpledateformat SDF = new Java. text. simpledateformat ("yyyy/yyyymm/yyyymmdd"); string inputdir = "file: // opt/log/" + SDF. format (date) + "/* access. log *. GZ "; logger.info (" inputdir = "+ inputdir); * // final string outfilename ="/log/"+ new Java. text. simpledateformat ("yyyymmdd "). format (date) + "/access"; logger.info ("OUTFILE dir =" + outfilename); Path OUTFILE = New Path (outfilename); filesystem. get (job. getconfiguration ()). delete (OUTFILE, true); fileinputformat. addinputpath (job, new path (inputdir); fileoutputformat. setoutputpath (job, OUTFILE); job. waitforcompletion (true); Return 0 ;}}
The above is a job driver. The public job is used here.(Configuration Conf,
String jobname)
This constructor has the following relationship diagram:
JobcontextIt provides read-only attributes and contains two members, jobconf and jobid. The read-only attribute provided by jobid is read from jobconf except that jobid is saved in jobid. It includes the following attributes:
1. mapred. Reduce. tasks. The default value is 1.
2. mapred. Working. dir, working directory of the file system
3. mapred. Job. Name: name of the job set by the user
4. mapreduce. Map. Class
5. mapreduce. inputformat. Class
6. mapreduce. Combine. Class
7. mapreduce. Reduce. Class
8. mapreduce. outputformat. Class
9. mapreduce. partitioner. Class
Job: The Job Submitter's view of the job. It allows the user
Configure the job, submit it, control its execution, and query the state. the Set methods only work until the job is submitted, afterwards they will throw an illegalstateexception. it can configure, submit, control its execution, and query the status of a job.
It has two members: jobclient and runningjob.
Let's take a look at mapper.
Mapper <keyin, valuein, keyout, valueout>
This is the Mapper class, which has a built-in context class.
Public class mrloadmapper extends mapper <longwritable, text, longwritable, searchlogprotobufwritable> {private final static logger = logger. getlogger (mrloadmapper. class); private long id = 0; private final simpledateformat SDF = new simpledateformat ("-dd/Mmm/yyyy: hh: mm: ss z "); private Final longwritable outputkey = new longwritable (); private final searchlogprotobufwritable outputvalue = new sea Rchlogprotobufwritable (); @ overrideprotected void setup (context) throws ioexception, interruptedexception {int taskid = context. gettaskattemptid (). gettaskid (). GETID (); this. id = taskid <40 ;}@ overrideprotected void map (longwritable key, text value, context) throws ioexception, interruptedexception {string line = value. tostring (); Final searchlog. builder; searchlog MSG = NULL; t Ry {string [] Cols = line. Split ("\ t"); If (cols. length! = 9) {context. getcounter ("bad", "colsmissmatch "). increment (1); logger.info ("number of processed files" + key. get () + "Row content error, content is [" + LINE + "], the number of columns does not match"); return;} string remote_addr = Cols [0]; string time_local = Cols [1]; // todo: Set timestring remote_user = Cols [2]; string url = Cols [3]; string status = Cols [4]; string body_bytes_sent = Cols [5]; string http_user_agent = Cols [6]; string dummy2 = Cols [7]; string dummy3 = Cols [8]; Builder = searchlog. newbuilder (). setremoteaddr (remote_addr ). setstatus (integer. valueof (Status )). setid (++ this. ID ). settime (this. SDF. parse (time_local ). gettime ()). setbodybytessent (integer. valueof (body_bytes_sent )). setua (http_user_agent); COM. sunchangming. searchlog. loadplainlog. parseurl (URL, builder); MSG = builder. build ();} catch (locationcannotnull ex) {context. getcounter ("bad", "Lo is blank "). increment (1); Co Ntext. getcounter ("bad", "parseerror "). increment (1); logger.info ("An error occurred while processing the file content. Lo is empty, and the content is [" + LINE + "]");} catch (urischemeislocal ex) {context. getcounter ("bad", "urischeme is file "). increment (1); context. getcounter ("bad", "parseerror "). increment (1);} catch (urischemeerror ex) {context. getcounter ("bad", "urischemeerror "). increment (1); context. getcounter ("bad", "parseerror "). increment (1);} catch (exception ex) {Context. getcounter ("bad", "parseerror "). increment (1); logger.info ("An error occurred while processing the file content. The content is [" + LINE + "], and the error message is [" + ex. tostring () + "]");} If (MSG! = NULL) {This. outputkey. Set (this. ID); this. outputvalue. Set (MSG); context. Write (this. outputkey, this. outputvalue );}}}
The entire program uses only one Mapper to convert and transmit data to HDFS. Next, let's learn how to use pig to process the data.
2. Pig Introduction
Http://pig.apache.org/docs/r0.10.0/start.html
This link is a pig tutorial.
Loading data
Use the load operator and
Load/Store functions to read data into pig (pigstorage is the default load function ).
Working with data
Pig allows you to transform data in different ways. As a starting point, become familiar with these operators:
Use the filter operator to work with tuples or rows of data. Use
Foreach operator to work with columns of data.
Use the group operator to group data in a single relation. Use
Cogroup,
Inner join, and
Outer Join operators to group or join data in two or more relations.
Use the Union operator to merge the contents of two or more relations. Use
Split operator to partition the contents of a relation into multiple relations
Storing intermediate results
Pig stores the intermediate data generated between mapreduce jobs in a temporary location on HDFS. this location must already exist on HDFS prior to use. this location can be configured using the pig. temp. dir property. the property's default value is "/tmp"
Which is the same as the hardcoded location in pig 0.7.0 and earlier versions.
Storing Final Results
Use the store operator and
Load/Store functions to write results to the file system (pigstorage is the default store function ).
Compared with mapreduce, pig provides richer data structures and powerful data conversion operations.
Here is an example of how to process the data from the above map:
Register/home/app_admin/scripts/piglib /*. jar/** searchlog class here */register/home/app_admin/scripts/loadplainlog. jarrmf/result/keyword/$ logdatea = load'/log/$ logdate/access/part-m-* 'using COM. searchlog. accesslogpigloader;/** and then filter. The domain name on the current page is www. ****. com. the keyword is not empty. On the first page, there are direct zone results or common search results. The URL of the current page is the specific path */a_all = Filter A by location. host = 'www. ****. com 'and keyword is not null and keyword! = ''And pageno = 1;/** group by keyword */B _all = group a_all by keyword; /** calculate the count */c_all = foreach B _all generate group for the grouping result, count (a_all.id) as keywordsearchcount, max (a_all.vs) as; /** sort the calculation results by search times */d_all = order c_all by keywordsearchcount DESC; Result = foreach d_all generate group, keywordsearchcount,; /** Save the result to the file */store result into '/result/keyword/$ logdate/' using pigstorage ();