The most interesting place for Hadoop is the job scheduling of Hadoop, and it is necessary to have a thorough understanding of Hadoop's job scheduling before formally introducing how to build Hadoop. We may not necessarily be able to use Hadoop, but if the principle of distributed scheduling is fluent Hadoop, you may not be able to write a mini hadoop~ when you need it:
Start
Map/reduce is a distributed computing model for large-scale data processing, originally designed and implemented by Google engineers, and Google has published its complete mapreduce paper publicly. The definition of it is that Map/reduce is a programming model (programming model), a related implementation for processing and generating large datasets (處理 and generating SCM data sets). The user defines a map function to handle a key/value pair to generate a batch of intermediate key/value pairs, and then defines a reduce function that merges all the values in the middle with the same key. Many tasks in the real world can be expressed in this model.
The Map/reduce framework for Hadoop is also based on this principle, and the following is a brief introduction to the main components and relationships of the map/reduce framework.
2.1 Overall Structure 2.1.1 Mapper and reducer
The most basic component of a MapReduce application running in Hadoop consists of a mapper and a reducer class, as well as an execution program to create jobconf, which in some applications can also include a combiner class, which is actually a reducer implementation.
2.1.2 Jobtracker and Tasktracker
They are scheduled by a master service Jobtracker and multiple slaver services running on multiple nodes Tasktracker two classes. Every subtask task that Master is responsible for scheduling a job runs on the slave and monitors them, and if a failed task is found to run it again, slave executes each task directly. Tasktracker need to run on HDFs's Datanode, Jobtracker is not needed, and in general, Jobtracker should be deployed on separate machines.
2.1.3 Jobclient
Each job configures the application and configuration parameter revisit into a jar file on the client via the Jobclient class in HDFs and submits the path to the Jobtracker master service. Then master creates each task (i.e. Maptask and Reducetask) and distributes them to each Tasktracker service for execution.
2.1.4 Jobinprogress
After Jobclient commits the job, Jobtracker creates a jobinprogress to track and schedule the job and add it to the job queue. Jobinprogress creates a corresponding batch of taskinprogress for monitoring and scheduling maptask, based on the input dataset defined in the submitted job jar (decomposed into filesplit), while creating the specified number Taskinprogress is used for monitoring and scheduling reducetask, and defaults to 1 reducetask.
2.1.5 Taskinprogress
Jobtracker the Task objects (that is, Maptask and Reducetask) are serialized to the corresponding Tasktracker service when the tasks are started by each taskinprogress launchtask. When Tasktracker is received, it creates a corresponding taskinprogress (this taskinprogress implements Taskinprogress, which is used in non-jobtracker, similar) for monitoring and scheduling the task. Starting a specific task process runs through the Taskinprogress managed Taskrunner object. Taskrunner automatically loads the Job.jar and sets the environment variable to start a separate Java child process to execute the task, that is, maptask or reducetask, but they do not necessarily run in the same tasktracker.
2.1.6 Maptask and Reducetask
A full job automatically executes mapper, combiner (executed when jobconf specifies combiner) and reducer, where mapper and combiner are executed by the Maptask call, Reducer is called by Reducetask, Combiner is actually the implementation of Reducer interface class. Mapper will read through <key1,value1> to the input dataset defined in the job jar, processing the generated temporary <key2,value2> pairs, if combiner is defined, The Maptask will then call the combiner to merge the values of the same key in mapper to reduce the output result set. Maptask's task is complete, the reducetask process calls reducer processing to generate the final result <key3,value3> pair. This process is described in more detail in the next section.
The following figure describes the main components of the Map/reduce framework and their relationships:
The 2.2 job creation Process 2.2.1 Jobclient.runjob () starts running the job and decomposes the input dataset
A mapreduce job uses the Jobclient class to decompose the input dataset into a batch of small datasets based on the InputFormat implementation class defined by the user in the Jobconf class, and each small data set corresponds to creating a maptask to process. Jobclient uses the default Fileinputformat class to invoke the Fileinputformat.getsplits () method to generate a small dataset, if the decision data file is issplitable (), The large file will be decomposed into small filesplit, of course, just record the path and offset and split size of the file in the HDFs. This information is packaged into a jobfile jar and stored in HDFs, and the Jobfile path is submitted to Jobtracker for dispatch and execution.
2.2.2 Jobclient.submitjob () submit job to Jobtracker
The Jobfile submission process is implemented through the RPC module, which is described in detail in a separate chapter. The general process is that the Jobtracker submitjob () method is invoked by the proxy interface implemented by RPC in the Jobclient class, and Jobtracker must implement the Jobsubmissionprotocol interface. Jobtracker then dispatches and executes the job by creating a series of objects related to the job, such as Jobinprogress and Taskinprogress, based on the Jobfile path obtained.
Jobtracker The Create job succeeds, a Jobstatus object is returned to the jobclient to record the status information for the job, such as the execution time, the ratio of map and reduce task completion, and so on. Jobclient creates a Networkedjob Runningjob object based on this Jobstatus object, which is used to monitor and print to the user's console by periodically obtaining statistics from the Jobtracker for the execution process.
The classes and methods associated with creating a job process are shown in the following illustration
2.3 Job Execution Process
As mentioned above, the job is unified by Jobtracker, and the specific task is distributed to each Tasktracker node for execution. The following source to detailed analysis of the implementation process, first of all, from the Jobtracker received jobclient submission request.
2.3.1 Jobtracker initialization job and task queue process 2.3.1.1 Jobtracker.submitjob () received request
When Jobtracker receives a new job request (that is, the submitjob () function is invoked), a Jobinprogress object is created and used to manage and schedule the task. Jobinprogress Initializes a series of task-related parameters, such as the location of the job jar (which will copy it from HDFs to a temporary directory in the local file system), map and reduce data, job priority, and the object that records the statistic report.
2.3.1.2 jobtracker.resortpriority () Join queues and Sort by priority
When Jobinprogress is created, it is first added to the jobs queue, with a map member variable jobs used to manage all jobs objects, and a list member variable jobsbypriority to maintain jobs ' execution priorities. Jobtracker then invokes the resortpriority () function, sorting jobs by priority and then by the time of submission, which guarantees the highest priority and the job that is committed first executes.
2.3.1.3 Jobtracker.jobinitthread Notifies the initialization thread
Jobtracker then adds the job to a queue that manages to initialize, a list member variable jobinitqueue. Invoking the Notifyall () function through this member variable evokes a thread jobinitthread for initializing the job (Jobtracker will have several internal threads to maintain the jobs queue, all of which are implemented in Jobtracker code, Wait a minute for details). When the jobinitthread receives the signal, it takes out the top job, the highest priority job, and calls the jobinprogress Inittasks () function to perform the real initialization work.
2.3.1.4 Jobinprogress.inittasks () initialization taskinprogress
The initialization of a task is a little more complicated, and first step jobinprogress creates the monitoring object for the map. In the Inittasks () function, the Rawsplit list of decomposed input data is obtained by calling the Readsplitfile () of Jobclient, and then the corresponding number of map execution management Object Taskinprogress is created from the list. In this process, the host of all datanode nodes that correspond to the Rawsplit block corresponding to the blocks in HDFs is also recorded, which is obtained by Rawsplit's Filesplit () function when Getlocations is created, which invokes the Distributedfilesystem Getfilecachehints (The details are explained in the HDFs module). Of course, if it is stored in the local file system, that is, the use of LocalFileSystem of course only one location that "localhost."
Second, jobinprogress will create a monitoring object for reduce, which is simpler, based on the number of reduce specified in jobconf, which creates only 1 reduce tasks by default. Monitoring and scheduling the reduce task is also the Taskinprogress class, but the construction method is different, taskinprogress will be based on different parameters to create specific maptask or reducetask.
Jobinprogress The taskinprogress is created, the Jobstatus is finally constructed and the job is being executed, and then the execution log of the job is called JobHistory.JobInfo.logStarted (). Here Jobtracker the process of initializing the job is complete, and execution is handled in another asynchronous manner, as described below.
The classes and methods associated with initializing the job process are shown in the following illustration
2.3.2 Tasktracker The process of performing a task
The execution of a task is actually initiated by Tasktracker, Tasktracker periodically (the default is 10 seconds, see the heartbeat_interval variable defined in the Mrconstants Class) to communicate with Jobtracker. Report the execution status of your task, receive jobtracker instructions, etc. If you find that you have a new task that you need to perform, you will start at this point, in the Tasktracker call to Jobtracker's heartbeat () method, which is implemented by invoking the proxy interface (described in detail in the IPC section) at the IPC layer. This process is actually more complex, and here are one by one simple steps to follow.
2.3.2.1 Tasktracker.run () connection Jobtracker
The Tasktracker startup process Initializes a series of parameters and services (another one in a separate section) and then attempts to connect to the Jobtracker service (that is, the Intertrackerprotocol interface must be implemented), and if the connection is disconnected, It loops through the attempt to connect to Jobtracker and reinitialize all members and parameters, as you can see in the run () method.
2.3.2.2 Tasktracker.offerservice () main cycle
If the connection Jobtracker service succeeds, Tasktracker calls the Offerservice () function into the main execution loop. This loop communicates with Jobtracker every 10 seconds, calling Transmitheartbeat () to obtain heartbeatresponse information. The Getactions () function of the heartbeatresponse is then invoked to obtain all the instructions that Jobtracker pass, namely an array of tasktrackeraction. Iterate over this array, if it is a new task instruction that is launchtaskaction, call the Startnewtask () function to perform a new task, or join the Taskstocleanup queue and give it to a taskcleanupthread thread to handle it. such as the implementation of Killjobaction or killtaskaction.
2.3.2.3 tasktracker.transmitheartbeat () get jobtracker instruction
In Transmitheartbeat () function processing, Tasktracker creates a new Tasktrackerstatus object that records the execution of the current task, and then calls Jobtracker heartbeat () through the IPC interface. Method sends the past and accepts the new instruction, which is the return value tasktrackeraction array. Before this call, Tasktracker checks the number of tasks currently performed and the space usage of the local disk, and sets the Askfornewtask parameter of Heartbeat () to True if a new task can be received. After the operation is successful, update the relevant statistics and so on.
2.3.2.4 Tasktracker.startnewtask () Start a new task
The primary task of this function is to create a Tasktracker$taskinprogress object to schedule and monitor the task and add it to the Runningtasks queue. When finished, call Localizejob () to really initialize the task and start execution.
2.3.2.5 Tasktracker.localizejob () Initialize the job directory
The primary task of this function is to initialize the working directory Workdir, then copy the job jar package from HDFs to the local file system, and Invoke Runjar.unjar () to extract the package to the working directory. Then create a runningjob and call the Addtasktojob () function to add it to the Runningjobs monitoring queue. When you are done, call Launchtaskforjob () to start the task.
2.3.2.6 Tasktracker.launchtaskforjob () Perform Tasks
The work of starting a task is actually called the tasktracker$taskinprogress Launchtask () function to execute.
2.3.2.7 Tasktracker$taskinprogress.launchtask () Perform Tasks
Call Localizetask () to update the jobconf file and write to the local directory before performing the task. The task-independent Java Execution subprocess is then started by calling the task's Createrunner () method to create the Taskrunner object and calling its start () method.
2.3.2.8 Task.createrunner () Create a startup runner object
The task has two implementation versions, Maptask and Reducetask, that are used to create the map and reduce tasks, respectively. Maptask creates a maptaskrunner to start the task subprocess, and Reducetask creates the Reducetaskrunner to start.
2.3.2.9 Taskrunner.start () initiates a child process to actually perform a task
This is where the child process is really started and the task is executed. It calls the run () function to handle it. The process of execution is complex, and the main task is to initialize a series of environment variables that start the Java subprocess, including setting up working directory Workdir, setting CLASSPATH environment variables, and so on (merging the Tasktracker environment variables and the path of the job jar). The job jar package is then loaded, and the Runchild () method is invoked to start the subprocess, which is created through Processbuilder, while the stdout/stdin/syslog output of the child process is directed to the output log directory specified by the task. The specific output is implemented through the Tasklog class. Here's a small problem where the task subprocess can output only the info-level logs, and that level is specified directly in the run () function, but improvements are not complicated.
The classes and methods associated with the job execution process are shown in the following illustration
This article is reproduced from: http://www.cnblogs.com/shipengzhi/articles/2487429.html
12 Next