I. Overview of the MapReduce job processing process
When users are dealing with a problem using the MapReduce computational model of Hadoop, they only need to design mapper and reducer processing functions, and possibly include combiner functions. After that, create a new Job object and configure the job's run environment, and finally call the job's waitforcompletion or the Submit method to submit the job. The code is as follows:
1 //Create a new default job configuration Object2Configuration conf =NewConfiguration ();3 //create a Job object based on the configuration object and job name4Job Job =NewJob (conf, "name of the job");5 //when running a job in a cluster, the job needs to be packaged in the form of a jar, and Hadoop finds the jar containing the class by the specified class name .6Job.setjarbyclass (main class name).class);7Job.setmapperclass (Mapper implementation class.class);8Job.setcombinerclass (as a reducer implementation class for combiner.class);9Job.setreducerclass (Reducer implementation class.class);TenJob.setoutputkeyclass (the data type of the output key.)class); OneJob.setoutputvalueclass (the data type of the output alue.class); AFileinputformat.addinputpath (Job,NewPath (sets the input path for the job); -Fileoutputformat.setoutputpath (Job,NewPath (sets the output path of the job)); - //submit a job to the cluster processing theJob.waitforcompletion (true);
Shows the job execution process for mapreducer. The job's WaitForCompletion method internally relies on jobclient to submit jobs to jobtracker. When Jobtracker receives a request from jobclient to submit a job, the job is added to the job queue, which is then returned to jobclient a Jobid object that uniquely identifies the job. Jobs in the Jobtracker job queue are performed by Tasktracker. Tasktracker will periodically want to jobtracker send heartbeat, inquire jobtracker whether there is a task to be executed. If so, the Jobtrackler will be assigned to Tasktracker to perform the heartbeat accordingly. When Tasktracker accepts a task, a task is created locally to perform the tasks.
Here are a few important points to know about jobclient in the execution process:
1. Configuration information for jobconf mapreduce jobs
The Jobconf class inherits from the configuration class, which, on the basis of the basic configuration information of Hadoop, joins an information related to the MapReduce job. This is the class that enables the user to configure the job assignment.
2. Jobsubmissionprotocol Job Submission Excuse
The Jobsubmissionprotocol protocol interface is the Protocol interface required for jobclient and jobtracker to communicate. This interface defines methods such as jobclient for submitting jobs to jobtracker, obtaining execution information for jobs, and so on.
3. Runningjob an excuse for running job jobs
Runningjob provides the user with access to the running MapReduce job information interface. Runningjob is implemented in jobclient, so we can get an example of Runningjob by Jobclient and then use the Runningjob instance to query information about the running job.
4. Jobstatus and Jobprofile job status information and registration information
The Jobstatus object represents the current running state information for the job, such as the progress information that all the mapper tasks that make up the job have completed, and the status information changes as the job runs. The Jobprofile object represents the registration information that is carried when the job is added to the MapReduce framework, and the registration information is not changed.
5. Jobsubmissionfiles get the file submitted by the job
MapReduce during initialization, the Hadoop framework creates the appropriate directory for the job submitted by the user, and then stores the files associated with the job, such as the jar file required for MapReduce, or the job's profile. The Jobsubmissionfiles class provides methods for accessing job-related files and directories that hold different types of files in the distributed cache corresponding to the job, but this class is only used inside the Hadoop framework.
Two. Jobclient Submit Job Flow
1. Request the ID object of a new job to Jobtracker Jobid
2. Check the job's input and output. The input cannot be empty, and the output cannot already exist until the job is run
3. Calculate the number of mapper tasks required for all inputsplit input shards of the job
4. Launch the job-related distributed cache Distributedcache
5. Copy the resources required for the job runtime, including the job's jar package, configuration files, and so on from the Distributed file system in Hadoop to the specified directory in the Jobtracker file system
6. Submit the job to the Jobtracker job queue and monitor the health of the job
Hadoop Learning note three--jobclient execution process