[Read hadoop source code] [9]-mapreduce-job submission process

Source: Internet
Author: User

1. Use wordcount as the entry

 Public   Class Wordcount { Public   Static   Void Main (string [] ARGs) throws exception {configuration conf = New Configuration (); string [] otherargs = New Genericoptionsparser (Conf, argS). getremainingargs (); job = New Job (Conf ," Word Count "); // The main task of the job class is to set various parameters job. setjarbyclass (wordcount. Class ); // It will contain wordcount.Class To upload the job. setmapperclass (tokenizermapper. Class ); Job. setcombinerclass (intsumreducer. Class ); Job. setreducerclass (intsumreducer. Class ); Job. setoutputkeyclass (text. Class ); Job. setoutputvalueclass (intwritable. Class ); Fileinputformat. addinputpath (job, New PATH (otherargs [0]); fileoutputformat. setoutputpath (job, New PATH (otherargs [1]); system. Exit (job. waitforcompletion (True )? 0: 1); // If the input parameter is true, print the job operation information in time; otherwise, wait until the job ends }}
 
 

 

2. Job. waitforcompletion (True) Submit is mainly used in the function.

 
Public VoidSubmit () throws ioexception, interruptedexception, classnotfoundexception {ensurestate (jobstate. define); // job. setmapperclass (XXX. class): mapreduce is actually set. map. class, that is, new. Jobconf. setmapperclass (XXX. Class): mapred. Mapper. Class is actually set, that is, old. It can be seen that newapi is used without calling jobconf. setmapperclass. Setusenewapi (); Info = jobclient. submitjobinternal (CONF); // the actual job submission process. info is used to interact with jobtracker. State = jobstate is used to monitor and kill submitted jobs. running ;}

 

3. before reading the jobclient. submitjobinternal (CONF) function, first look at the construction process of the jobclient object:

3.1 first include the mapred-site.xml and core-site.xml in Conf. StaticCode

3.2 init Function

  Public   void  Init (jobconf conf) throws ioexception {string tracker = Conf.  Get  (" mapred. job. tracker  ","  Local  "); // if not set, alternatively, the preceding xml configuration file  If  (" Local  ". equals (tracker) {Conf. setnummaptasks (1); // reduce in local mode can only be 1  This . jobsubmitclient =  New  localjobrunner (CONF) ;}< span style = "color: # 0000ff "> else  {// create RPC  This . jobsubmitclient = createrpcproxy (jobtracker. getaddress (CONF), conf) ;}// the class corresponding to jobsubmitclient is one of the implementations of jobsubmissionprotocol (currently there are two implementations, jobtracker and localjobrunner) 

3.3 submitjobinternal Function

 Public Runningjob submitjobinternal (jobconf job) {jobid = jobsubmitclient. getnewjobid (); Path submitjobdir = New PATH (getsystemdir (), jobid. tostring ()); // Conf. Get ("mapred. system. dir", "/tmp/hadoop/mapred/system ") Path submitjarfile = New PATH (submitjobdir ," Job. Jar "); Path submitsplitfile = New PATH (submitjobdir ," Job. Split ");/* 1. create the submitjobdir directory, 2. put the jars, files, and archives specified in the parameter in the distributed cache, * 3. upload the jar package of the main function as submitjarfile * 4. set user and group. This permission affects HDFS file operations. Set the current working directory **/ Configurecommandlineoptions (job, submitjobdir, submitjarfile); Path submitjobfile = New PATH (submitjobdir ," Job. xml "); Int Reduces = job. getnumreducetasks (); jobcontext context = New Jobcontext (job, jobid ); // Check whether the output directory exists. If yes, an error is returned. Org. Apache. hadoop. mapreduce. outputformat <? ,? > Output = reflectionutils. newinstance (context. getoutputformatclass (), job); output. checkoutputspecs (context );// Create the splits for the job Log. debug (" Creating splits "+ FS. makequalified (submitsplitfile )); Int Maps = writenewsplits (context, submitsplitfile ); /// Determine the split Information Job. Set (" Mapred. Job. Split. File ", Submitsplitfile. tostring (); job. setnummaptasks (MAPS ); // Determine the number of maps      // Write all job parameters to job. xml on HDFS Fsdataoutputstream Out = Filesystem. Create (FS, submitjobfile, New Fspermission (job_file_permission); job. writexml ( Out ); Jobstatus status = jobsubmitclient. submitjob (jobid ); // The actual submitted job will be handed over to jobtracker for processing.      If (Status! = Null ){ Return   New Networkedjob (Status );} Else { Throw   New Ioexception (" Cocould not launch job ");}}

 

 

 

The rest focuses on how to determine the split. The main involved functions include the following process. Take fileinputformat as an example:

1.1 int writenewsplits (jobcontext job, path submitsplitfile)

2.1 ----> List <inputsplit> getsplits (jobcontext job)

3.1 ----> List <filestatus> liststatus (jobcontext job) // filter out the input path in the format of _ and. the path starting with mapred. input. pathfilter. class filters files to obtain the file list.

3.2 If the file is compressed, that is, it cannot be splitable, then the entire file is split as

3.3 If the file is splitable, calculate the size of each split first. mapred. Min. Split. size. The default size is 1,

The default value of mapred. Max. Split. size is long. max_value, and the default value of blocksize is 64 MB,

The split size is math. Max (minsize, math. Min (maxsize, blocksize). We can see from the formula that,

If maxsize is set to be greater than blocksize, each block is a part. Otherwise, a block file is separated into multiple parts,

If the remaining small data volume in the block is smaller than splitsize, it is considered as an independent shard.

3.4 Save the path, size, subscript, and location information of each split to the split Array

2.2 install and sort the size of each split, put the larger one in front, and serialize it to the file.

 

 

References

Http://langyu.iteye.com/blog/909170

Http://blog.csdn.net/andyelvis/article/details/7706205

Http://www.cnblogs.com/forfuture1978/archive/2010/11/19/1882268.html

Http://www.cnblogs.com/spork/archive/2010/04/21/1717552.html

Http://baoshengdeer.sinaapp.com /? P = 116

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.