In hadoop, mapreduce Java jobs usually start with writing Mapper and reducer, create a job object, and then use the set method of the object to set Mapper and reducer and parameters such as input and output, finally, call the waitforcompletion (true) method of the job object to submit the job and wait for the job to complete. Although a few words are used to describe the creation and submission of a job, the actual situation is much more complicated. This article will analyze the source code to learn more about this process.
Generally, when a job object is created using public job (configuration Conf, string jobname), the job name is specified. The hadoop Code only sets jobname as the value of mapred. Job. Name. Besides setting the job name, the job constructor uses the configuration object to initialize the conf object of org. Apache. hadoop. mapred. jobconf, and obtains the ugi of the current user using usergroupinformation. getcurrentuser. Jobconf is the main interface used to describe mapreduce jobs. Many methods, including setting job names, are completed by this class. The usergroupinformation class contains information about users and groups. It encapsulates JAAS (Java Authentication authorizationservice, Java authentication and authorization service) and provides methods to determine user names and groups.
After a job object is created, Mapper and reducer are usually set, such as job. setmapperclass, as mentioned above, this operation is actually completed by the jobconf object. The specific code is as follows. Other settings are similar:
public void setMapperClass(Class<? extends Mapper> cls) throws IllegalStateException { ensureState(JobState.DEFINE); conf.setClass(MAP_CLASS_ATTR, cls, Mapper.class);}
After setting the parameters required for running the job, run the job. waitforcompletion (true) submits a job to the cluster and waits for the job to be executed. A boolean parameter is used to determine whether to print the job execution progress to the user. The code for this method is as follows:
public boolean waitForCompletion(boolean verbose) throws IOException, InterruptedException,ClassNotFoundException { if (state == JobState.DEFINE) { submit(); } if (verbose) { jobClient.monitorAndPrintJob(conf, info); } else { info.waitForCompletion(); } return isSuccessful();}
When a new job is created, the job's jobstate state = jobstate. define, so the submit method will be executed in the above Code. After the submit is returned, different methods will be executed according to the verbose parameter true or false. The specific submit implementation:
public void submit() throws IOException, InterruptedException, ClassNotFoundException { ensureState(JobState.DEFINE); setUseNewAPI();//默认使用新版本中的API,除非显示设置了老版本的API // Connect to the JobTracker and submit the job connect(); info = jobClient.submitJobInternal(conf); super.setJobID(info.getID()); state = JobState.RUNNING;}
In submit, first confirm that the job state is jobstate. Define, and finally set the job to jobstate. running after the job is submitted. The connect method is used to open the connection to jobtracker. The code of this method is:
private void connect() throws IOException, InterruptedException { ugi.doAs(new PrivilegedExceptionAction<Object>() { public Object run() throws IOException { jobClient = new JobClient((JobConf) getConfiguration()); return null; } });}
Before further analysis, you need to understand two objects: jobclient and runningjobinfo. jobclient is the main interface for user jobs to interact with jobtracker, which has the ability to submit jobs and track job progress, access task logs and obtain mapreduce cluster status information. Runningjob is an interface used to query details of running mapreduce jobs. When jobclient's submitjobinternal is called, the internal networkedjob class of jobclient is returned (this class implements runningjob ). In the connect method, the jobclient object is instantiated, while the DOAs method of ugi returns the value of the run method, this method will be used later (the actual situation is that this method is heavily used ). In the jobclient constructor, The jobtracker connection is completed, and the work is handed over to the init method. The specific implementation of this method is as follows:
public void init(JobConf conf) throws IOException {String tracker = conf.get("mapred.job.tracker", "local");// mapreduce.client.tasklog.timeout tasklogtimeout = conf.getInt( TASKLOG_PULL_TIMEOUT_KEY, DEFAULT_TASKLOG_TIMEOUT); this.ugi = UserGroupInformation.getCurrentUser(); if ("local".equals(tracker)) { conf.setNumMapTasks(1); this.jobSubmitClient = new LocalJobRunner(conf); } else { this.rpcJobSubmitClient = createRPCProxy(JobTracker.getAddress(conf), conf); this.jobSubmitClient = createProxy(this.rpcJobSubmitClient, conf); } }
This method focuses on the non-standalone mode, that is, the mapred. Job. Tracker value is not local, that is, the code in the else statement. Rpcjobsubmitclient and jobsubmitclient are two objects of the jobsubmissionprotocol type. jobclient and jobtracker use this interface to communicate. jobclient uses this interface to submit jobs and understand the status of the current system. Methods createrpcproxy and createproxy are used to create client objects that implement jobsubmissionprotocol.
After connecting to jobtracker, use submitjobinternal of jobclient to submit a job to jobtracker. In this method, first determine the path to store the job file, which is $ {mapreduce. jobtracker. staging. root. dir}/{user-name }/. staging settings. If mapreduce is not set. jobtracker. staging. root. use/tmp/hadoop/mapred/staging/$ {user-name }/. staging. Create a directory named job ID in the preceding directory and set the parameter mapreduce. job. DIR is set to this value, that is, $ {mapreduce. jobtracker. staging. root. dir}/{user-name }/. staging/jobid, the above directories are relative to FS. default. the value set by name. Next, copy the JAR file of the job to $ {mapreduce. jobtracker. staging. root. dir}/{user-name }/. staging/jobid, and rename it as job. JAR file, which is completed by the copyandconfigurefiles method. Create a job. xml file in the preceding directory to obtain the number of reduce tasks, split the input file, and set the number of map tasks based on the number of split blocks. After completing the above work, use the following code to submit the job:
status = jobSubmitClient.submitJob( jobId, submitJobDir.toString(), jobCopy.getCredentials());
After submitting a job to jobtracker, jobtracker is responsible for executing the job, and the client that submits the job can choose whether to print the job execution progress.
In summary, the creation and submission of jobs in the Hadoop-1.2.1 includes the following processes:
- Set Input and Output Parameters of a job
- Copy job files and configuration files to a specific directory
- Calculate job shards and set the number of map tasks
- Submit a job to jobtracker and monitor the job running progress.
Hadoop-1.2.1 learning-job creation and submission source code analysis