1. Use wordcount as the entry
Public Class Wordcount { Public Static Void Main (string [] ARGs) throws exception {configuration conf = New Configuration (); string [] otherargs = New Genericoptionsparser (Conf, argS). getremainingargs (); job = New Job (Conf ," Word Count "); // The main task of the job class is to set various parameters job. setjarbyclass (wordcount. Class ); // It will contain wordcount.Class To upload the job. setmapperclass (tokenizermapper. Class ); Job. setcombinerclass (intsumreducer. Class ); Job. setreducerclass (intsumreducer. Class ); Job. setoutputkeyclass (text. Class ); Job. setoutputvalueclass (intwritable. Class ); Fileinputformat. addinputpath (job, New PATH (otherargs [0]); fileoutputformat. setoutputpath (job, New PATH (otherargs [1]); system. Exit (job. waitforcompletion (True )? 0: 1); // If the input parameter is true, print the job operation information in time; otherwise, wait until the job ends }}
2. Job. waitforcompletion (True) Submit is mainly used in the function.
Public VoidSubmit () throws ioexception, interruptedexception, classnotfoundexception {ensurestate (jobstate. define); // job. setmapperclass (XXX. class): mapreduce is actually set. map. class, that is, new. Jobconf. setmapperclass (XXX. Class): mapred. Mapper. Class is actually set, that is, old. It can be seen that newapi is used without calling jobconf. setmapperclass. Setusenewapi (); Info = jobclient. submitjobinternal (CONF); // the actual job submission process. info is used to interact with jobtracker. State = jobstate is used to monitor and kill submitted jobs. running ;}
3. before reading the jobclient. submitjobinternal (CONF) function, first look at the construction process of the jobclient object:
3.1 first include the mapred-site.xml and core-site.xml in Conf. StaticCode
3.2 init Function
Public void Init (jobconf conf) throws ioexception {string tracker = Conf. Get (" mapred. job. tracker "," Local "); // if not set, alternatively, the preceding xml configuration file If (" Local ". equals (tracker) {Conf. setnummaptasks (1); // reduce in local mode can only be 1 This . jobsubmitclient = New localjobrunner (CONF) ;}< span style = "color: # 0000ff "> else {// create RPC This . jobsubmitclient = createrpcproxy (jobtracker. getaddress (CONF), conf) ;}// the class corresponding to jobsubmitclient is one of the implementations of jobsubmissionprotocol (currently there are two implementations, jobtracker and localjobrunner)
3.3 submitjobinternal Function
Public Runningjob submitjobinternal (jobconf job) {jobid = jobsubmitclient. getnewjobid (); Path submitjobdir = New PATH (getsystemdir (), jobid. tostring ()); // Conf. Get ("mapred. system. dir", "/tmp/hadoop/mapred/system ") Path submitjarfile = New PATH (submitjobdir ," Job. Jar "); Path submitsplitfile = New PATH (submitjobdir ," Job. Split ");/* 1. create the submitjobdir directory, 2. put the jars, files, and archives specified in the parameter in the distributed cache, * 3. upload the jar package of the main function as submitjarfile * 4. set user and group. This permission affects HDFS file operations. Set the current working directory **/ Configurecommandlineoptions (job, submitjobdir, submitjarfile); Path submitjobfile = New PATH (submitjobdir ," Job. xml "); Int Reduces = job. getnumreducetasks (); jobcontext context = New Jobcontext (job, jobid ); // Check whether the output directory exists. If yes, an error is returned. Org. Apache. hadoop. mapreduce. outputformat <? ,? > Output = reflectionutils. newinstance (context. getoutputformatclass (), job); output. checkoutputspecs (context );// Create the splits for the job Log. debug (" Creating splits "+ FS. makequalified (submitsplitfile )); Int Maps = writenewsplits (context, submitsplitfile ); /// Determine the split Information Job. Set (" Mapred. Job. Split. File ", Submitsplitfile. tostring (); job. setnummaptasks (MAPS ); // Determine the number of maps // Write all job parameters to job. xml on HDFS Fsdataoutputstream Out = Filesystem. Create (FS, submitjobfile, New Fspermission (job_file_permission); job. writexml ( Out ); Jobstatus status = jobsubmitclient. submitjob (jobid ); // The actual submitted job will be handed over to jobtracker for processing. If (Status! = Null ){ Return New Networkedjob (Status );} Else { Throw New Ioexception (" Cocould not launch job ");}}
The rest focuses on how to determine the split. The main involved functions include the following process. Take fileinputformat as an example:
1.1 int writenewsplits (jobcontext job, path submitsplitfile)
2.1 ----> List <inputsplit> getsplits (jobcontext job)
3.1 ----> List <filestatus> liststatus (jobcontext job) // filter out the input path in the format of _ and. the path starting with mapred. input. pathfilter. class filters files to obtain the file list.
3.2 If the file is compressed, that is, it cannot be splitable, then the entire file is split as
3.3 If the file is splitable, calculate the size of each split first. mapred. Min. Split. size. The default size is 1,
The default value of mapred. Max. Split. size is long. max_value, and the default value of blocksize is 64 MB,
The split size is math. Max (minsize, math. Min (maxsize, blocksize). We can see from the formula that,
If maxsize is set to be greater than blocksize, each block is a part. Otherwise, a block file is separated into multiple parts,
If the remaining small data volume in the block is smaller than splitsize, it is considered as an independent shard.
3.4 Save the path, size, subscript, and location information of each split to the split Array
2.2 install and sort the size of each split, put the larger one in front, and serialize it to the file.
References
Http://langyu.iteye.com/blog/909170
Http://blog.csdn.net/andyelvis/article/details/7706205
Http://www.cnblogs.com/forfuture1978/archive/2010/11/19/1882268.html
Http://www.cnblogs.com/spork/archive/2010/04/21/1717552.html
Http://baoshengdeer.sinaapp.com /? P = 116