Hadoop Learning note three--jobclient execution process

Source: Internet
Author: User

I. Overview of the MapReduce job processing process

When users are dealing with a problem using the MapReduce computational model of Hadoop, they only need to design mapper and reducer processing functions, and possibly include combiner functions. After that, create a new Job object and configure the job's run environment, and finally call the job's waitforcompletion or the Submit method to submit the job. The code is as follows:

1 //Create a new default job configuration Object2Configuration conf =NewConfiguration ();3 //create a Job object based on the configuration object and job name4Job Job =NewJob (conf, "name of the job");5 //when running a job in a cluster, the job needs to be packaged in the form of a jar, and Hadoop finds the jar containing the class by the specified class name .6Job.setjarbyclass (main class name).class);7Job.setmapperclass (Mapper implementation class.class);8Job.setcombinerclass (as a reducer implementation class for combiner.class);9Job.setreducerclass (Reducer implementation class.class);TenJob.setoutputkeyclass (the data type of the output key.)class); OneJob.setoutputvalueclass (the data type of the output alue.class); AFileinputformat.addinputpath (Job,NewPath (sets the input path for the job); -Fileoutputformat.setoutputpath (Job,NewPath (sets the output path of the job)); - //submit a job to the cluster processing theJob.waitforcompletion (true);

Shows the job execution process for mapreducer. The job's WaitForCompletion method internally relies on jobclient to submit jobs to jobtracker. When Jobtracker receives a request from jobclient to submit a job, the job is added to the job queue, which is then returned to jobclient a Jobid object that uniquely identifies the job. Jobs in the Jobtracker job queue are performed by Tasktracker. Tasktracker will periodically want to jobtracker send heartbeat, inquire jobtracker whether there is a task to be executed. If so, the Jobtrackler will be assigned to Tasktracker to perform the heartbeat accordingly. When Tasktracker accepts a task, a task is created locally to perform the tasks.

Here are a few important points to know about jobclient in the execution process:

1. Configuration information for jobconf mapreduce jobs

The Jobconf class inherits from the configuration class, which, on the basis of the basic configuration information of Hadoop, joins an information related to the MapReduce job. This is the class that enables the user to configure the job assignment.

2. Jobsubmissionprotocol Job Submission Excuse

The Jobsubmissionprotocol protocol interface is the Protocol interface required for jobclient and jobtracker to communicate. This interface defines methods such as jobclient for submitting jobs to jobtracker, obtaining execution information for jobs, and so on.

3. Runningjob an excuse for running job jobs

Runningjob provides the user with access to the running MapReduce job information interface. Runningjob is implemented in jobclient, so we can get an example of Runningjob by Jobclient and then use the Runningjob instance to query information about the running job.

4. Jobstatus and Jobprofile job status information and registration information

The Jobstatus object represents the current running state information for the job, such as the progress information that all the mapper tasks that make up the job have completed, and the status information changes as the job runs. The Jobprofile object represents the registration information that is carried when the job is added to the MapReduce framework, and the registration information is not changed.

5. Jobsubmissionfiles get the file submitted by the job

MapReduce during initialization, the Hadoop framework creates the appropriate directory for the job submitted by the user, and then stores the files associated with the job, such as the jar file required for MapReduce, or the job's profile. The Jobsubmissionfiles class provides methods for accessing job-related files and directories that hold different types of files in the distributed cache corresponding to the job, but this class is only used inside the Hadoop framework.

Two. Jobclient Submit Job Flow

1. Request the ID object of a new job to Jobtracker Jobid

2. Check the job's input and output. The input cannot be empty, and the output cannot already exist until the job is run

3. Calculate the number of mapper tasks required for all inputsplit input shards of the job

4. Launch the job-related distributed cache Distributedcache

5. Copy the resources required for the job runtime, including the job's jar package, configuration files, and so on from the Distributed file system in Hadoop to the specified directory in the Jobtracker file system

6. Submit the job to the Jobtracker job queue and monitor the health of the job

Hadoop Learning note three--jobclient execution process

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.