Hadoop API usage

Source: Internet
Author: User

Address; http://hi.baidu.com/befree2008wl/blog/item/dcbe864f37c9423caec3ab7b.html

 

Hadoop APIs are divided into the following main packages)

Org. Apache. hadoop. conf defines the configuration file processing API for system parameters.

Org. Apache. hadoop. FS defines the abstract file system API.

Implementation of the org. Apache. hadoop. DFS hadoop Distributed File System (HDFS) module.

Org. Apache. hadoop. Io defines general I/O APIs for reading and writing data objects such as networks, databases, and files.

Org. Apache. hadoop. IPC is a tool used for network servers and clients. It encapsulates basic modules of Asynchronous Network I/O.

Implementation of the org. Apache. hadoop. mapred hadoop Distributed Computing System (mapreduce) module, including task distribution and scheduling.

Org. Apache. hadoop. Metrics defines an API for Performance Statistics, mainly used for mapred and DFS modules.

Org. apache. hadoop. record defines the I/O API class for records and a record description language translator, which is used to simplify serialization of records into a language-neutral format (Language-neutral manner ). Org. Apache. hadoop. Tools defines some common tools.

Org. Apache. hadoop. util defines some public APIs.


Mapreduce Framework Structure

MAP/reduce is a distributed computing model used for large-scale data processing. It was originally designed and implemented by Google engineers and has been published by Google.

The definition of MAP/reduce is a programming model, which is used to process and generate large-scale data sets. You can define a map function to process a key/value pair to generate a batch of intermediate key/value pairs, and then define a reduce function to combine all the values with the same key in the middle. Many tasks in the real world can be expressed using this model.

The MAP/reduce framework of hadoop is also implemented based on this principle. The following describes the main components and relationships of the MAP/reduce framework.

2.1 Overall Structure

2.1.1 Mapper and reducer

The most basic components of mapreduce applications running on hadoop include a er and a reducer class, as well as an execution program for creating jobconf, and a combiner class in some applications, it is also the implementation of reducer.

2.1.2 jobtracker and tasktracker

They are all scheduled by one master service jobtracker and multiple slaver service tasktracker running on multiple nodes. The master is responsible for scheduling each sub-task of a job on slave, and monitoring them. If a failed task is found, the master re-runs it. Slave is responsible for directly executing each task. Tasktracker needs to run on HDFS datanode, while jobtracker does not. Generally, jobtracker should be deployed on a separate machine.

2.1.3 jobclient

Each job will package the application and configuration parameter configuration into a jar file on the client through the jobclient class and store it in HDFS, and submit the path to the master service of jobtracker, then, the master creates each task (maptask and reducetask) and distributes them to various tasktracker services for execution.

2.1.4 jobinprogress

After jobclient submits a job, jobtracker creates a jobinprogress to track and schedule the job and add it to the job queue. Jobinprogress creates a batch of taskinprogress for monitoring and scheduling maptasks based on the input dataset defined in the submitted job jar (which has been broken down into filesplit, at the same time, you can create a specified number of taskinprogress for monitoring and scheduling reducetask. The default value is 1 reducetask.

2.1.5 taskinprogress

When jobtracker starts a task, it uses every taskinprogress to launch the task. In this case, the task object (maptask and reducetask) is serialized and written to the corresponding tasktracker service, after the tasktracker receives the task, it creates the corresponding taskinprogress (this taskinprogress implements the taskinprogress used in non-jobtracker) for monitoring and scheduling the task. Start a specific task process by running the taskrunner object managed by taskinprogress. Taskrunner automatically loads the job jar, sets the environment variables, and starts an independent Java child process to execute the task, namely, maptask or reducetask, but they do not necessarily run in the same tasktracker.

2.1.6 maptask and reducetask

A complete job will automatically execute Mapper, combiner (when combiner is specified in jobconf), and reducer in sequence. Mapper and combiner are called and executed by maptask, while CER is called by reducetask, combiner is actually an implementation of the reducer interface class. Mapper reads the data according to the input dataset defined in the job jar according to the pair. After processing, a temporary pair is generated. If combiner is defined, maptask calls the combiner to merge values of the same key in Mapper to reduce the output result set. After all maptask tasks are completed, the cetcetask process calls CER to generate the final result pair. This process will be detailed in the next section. Describes the main components of the MAP/reduce framework and their relationships:


2.2 job creation process

2.2.1 jobclient. runjob () starts to run the job and breaks down the input dataset

A mapreduce job splits the input dataset into a batch of small datasets by using the jobclient Class Based on the inputformat implementation class defined in the jobconf class, A maptask is created for each small data set for processing. Jobclient uses the default fileinputformat class to call fileinputformat. the getsplits () method generates a small dataset. If the data file is determined to be issplitable (), the large file will be divided into small filesplit, of course, it only records the path, offset, and split size of the file in HDFS. The information is packaged in the jar of jobfile and stored in HDFS. Then, the jobfile path is submitted to jobtracker for scheduling and execution.

2.2.2 jobclient. submitjob () Submit a job to jobtracker

The submission process of jobfile is implemented through the RPC module (which is described in detail in a unique chapter. The general process is to call the submitjob () method of jobtracker through the proxy interface implemented by rpc in the jobclient class, while jobtracker must implement the jobsubmissionprotocol interface. Jobtracker creates a series of job-related objects (such as jobinprogress and taskinprogress) based on the obtained jobfile path to schedule and execute jobs. After a job is created successfully, jobtracker returns a jobstatus object to jobclient to record the status information of the job, such as the execution time, the ratio of map tasks to reduce tasks, and so on. Jobclient will create a runningjob object for networkedjob based on this jobstatus object, which is used to regularly obtain statistics of the execution process from jobtracker to monitor and print it to the user's console. Shows the classes and methods related to the job creation process.

2.3 job execution process

As mentioned above, jobs are centrally scheduled by jobtracker, and specific tasks are distributed to each tasktracker node for execution. The following code is used to parse the execution process in detail. First, jobtracker receives the jobclient submission request.

2.3.1 job and task queue initialization by jobtracker

2.3.1.1 jobtracker. submitjob () receives the request

After jobtracker receives a new job request (that is, the submitjob () function is called), it creates a jobinprogress object and uses it to manage and schedule tasks. Jobinprogress initializes a series of task-related parameters during creation, such as the location of the job jar (it will be copied from the temporary directory in the local file system of HDFS ), map and reduce data, job priority, and objects that record statistical reports.

2.3.1.2 jobtracker. resortpriority () is added to the queue and sorted by priority.

After jobinprogress is created, it is first added to the jobs queue. a map member variable jobs is used to manage all jobs objects, and a list member variable jobsbypriority is used to maintain the priority of jobs. Jobtracker then calls the resortpriority () function to sort jobs by priority and then by submission time. This ensures that jobs that are submitted at the highest priority are executed first.

2.3.1.3 jobtracker. jobinitthread notifies the initialization thread

Then jobtracker adds the job to a queue to be initialized, that is, a list member variable jobinitqueue. Calling the yyall () function using this member variable will call a thread jobinitthread used to initialize the job for processing (jobtracker has several internal threads to maintain the jobs queue, their implementations are all in the jobtracker code and will be detailed later ). After jobinitthread receives the signal, it obtains the top job, that is, the job with the highest priority. It calls the inittasks () function of jobinprogress to execute the real initialization work.

2.3.1.4 jobinprogress. inittasks () initialize taskinprogress

The initialization process of a task is a little complicated. First, step jobinprogress creates a map monitoring object. In the inittasks () function, call jobclient's readsplitfile () to obtain the rawsplit list of input data that has been decomposed, and then create a corresponding number of map execution management objects taskinprogress based on the list. In this process, the host of all the datanode nodes where blocks in HDFS corresponding to the rawsplit block is also recorded. This will pass the getlocations () of filesplit when rawsplit is created () obtain the function. This function will call getfilecachehints () of distributedfilesystem to obtain the function (This details will be explained in the HDFS module ). Of course, if it is stored in the local file system, that is, when localfilesystem is used, there is only one location, that is, "localhost. Second, jobinprogress creates the reduce monitoring object. This is relatively simple. It is created based on the specified reduce number in jobconf. By default, only one reduce task is created. The taskinprogress class monitors and schedules reduce tasks. However, the constructor method is different. taskinprogress creates specific maptasks or reducetask based on different parameters. After jobinprogress is created, it constructs jobstatus and records that the job is being executed. Then it calls jobhistory. jobinfo. logstarted () to record the job execution log. Here, the job initialization process in jobtracker is complete, and execution is processed in another asynchronous way. Next we will introduce it. Shows the classes and methods related to the job initialization process.

2.3.2tasktracker

Task execution is actually initiated by tasktracker. tasktracker regularly communicates with jobtracker (the default value is 10 seconds. For details, see the heartbeat_interval variable defined in the mrconstants class, report the execution status of your task and receive jobtracker commands. If you find that a new task needs to be executed is started at this time, that is, when tasktracker calls the heartbeat () method of jobtracker, the underlying layer of this call is implemented by calling the proxy interface through the IPC layer (detailed in the IPC section.

This process is actually complicated. Next we will briefly describe each step.

2.3.2.1 tasktracker. Run () connection jobtracker

The Startup Process of tasktracker initializes a series of parameters and services (described in another separate section), and then tries to connect to the jobtracker Service (that is, the intertrackerprotocol interface must be implemented). If the connection is disconnected, the system cyclically tries to connect to jobtracker and reinitializes all the Members and parameters. For details about this process, see the run () method.

2.3.2.2 tasktracker. offerservice () Main Loop

If the jobtracker service is successfully connected, tasktracker calls the offerservice () function to enter the main execution cycle. This loop communicates with jobtracker every 10 seconds and calls transmitheartbeat () to obtain heartbeatresponse information. Call the getactions () function of heartbeatresponse to obtain all the commands passed by jobtracker, that is, a tasktrackeraction array. Traverse the array again. If it is a new task instruction, that is, launchtaskaction, call the startnewtask () function to execute the new task. Otherwise, add the task to the taskstocleanup queue and submit it to a taskcleanupthread thread for processing, such as killjobaction or killtaskaction.

2.3.2.3 tasktracker. transmitheartbeat () command to get jobtracker

In transmitheartbeat () function processing, tasktracker creates a new tasktrackerstatus object to record the execution status of the current task, and then sends it through the IPC interface to call the heartbeat () method of jobtracker, and accept new commands, that is, the returned tasktrackeraction array. Before this call, tasktracker checks the number of tasks currently executed and the space usage of the local disk. If new tasks can be received, set the askfornewtask parameter of heartbeat () to true. Update related statistics after the operation is successful.

2.3.2.4 tasktracker. startnewtask () Start a new task

The main task of this function is to create a tasktracker $ taskinprogress object to schedule and monitor the task and add it to the runningtasks queue. Call localizejob () to initialize the task and start execution.

2.3.2.5 tasktracker. localizejob () initialize the job directory, etc.

The main task of this function is to initialize the working directory workdir, copy the job jar package from HDFS to the local file system, and call runjar. unjar () to decompress the package to the working directory. Create a runningjob and call the addtasktojob () function to add it to the runningjobs monitoring queue. Call launchtaskforjob () to execute the task.

2.3.2.6 tasktracker. launchtaskforjob ()

The task is actually started by calling the launchtask () function of tasktracker $ taskinprogress.

2.3.2.7 tasktracker $ taskinprogress. launchtask () executes the task

Call localizetask () to update the jobconf file and write it to the local directory before executing the task. Then, create the taskrunner object by calling the createrunner () method of the task, call its start () method, and finally start the independent Java sub-process of the task to execute the task.

2.3.2.8 task. createrunner () create a start runner object

Tasks have two implementation versions: maptask and reducetask, which are used to create map and reduce tasks respectively. Maptask creates maptaskrunner to start the subprocess of the task, while reducetask creates reducetaskrunner to start the subprocess.

2.3.2.9 taskrunner. Start () promoter process actually executes the task

This is where the process is actually started and executed. It calls the run () function for processing. The execution process is complex. The main task is to initialize a series of environment variables for starting the Java sub-process, including setting the working directory workdir, set the classpath environment variables (you need to combine the tasktracker environment variables and the job jar path ). Then, load the job jar package and call the runchild () method to start the Worker Process, which is created through processbuilder, at the same time, the output of stdout, stdin, and Syslog of the sub-process is directed to the output log directory specified by the task. The specific output is implemented through the tasklog class. There is a small problem here. The task subprocess can only output Info-level logs, and this level is directly specified in the run () function, but the improvement is not complicated. Shows the classes and methods related to the job execution process.


2.4 jobtracker and tasktracker

As described above, jobtracker and tasktracker are the two most basic services in the mapreduce framework. All other processes are scheduled and executed by them. The following describes the internal services and the created threads, detailed process next Decomposition

2.4.1jobtracker services and threads

Jobtracker is one of the most important classes in the mapreduce framework. It schedules the execution of all jobs, and only one jobtracker application is configured in the hadoop system. After jobtracker is started, it initializes several services and several internal threads to maintain the job execution process and results. The following describes them briefly. First, jobtracker starts an intertrackerserver. The "mapred. Job. Tracker" parameter of the port configuration in configuration is bound to port 8012 by default. It has two purposes: one is to receive and process tasktracker's heartbeat and other requests, that is, the intertrackerprotocol interface and protocol must be implemented. Second, it is used to receive and process jobclient requests, such as submitjob and killjob. That is, the jobsubmissionprotocol interface and protocol must be implemented. Second, it starts an infoserver and runs statushttpserver. The default listening port is port 50030. Is a web service used to provide users with services for querying the job execution status on the Web interface. Jobtracker also starts multiple threads. The expirelaunchingtasks thread is used to stop tasks that do not report progress within the time-out period. The expiretrackers thread is used to stop tasktrackers that may have been discarded. That is, tasktrackers that have not been reported for a long time will not be assigned new tasks. The retirejobs thread is used to clear jobs that have been completed for a long time and still exist in the queue. The jobinitthread thread is used to initialize a job, which is described in the previous section. The taskcommitqueue thread is used to schedule all the processes related to the filesystem operation of a task and record the status of the task.

2.4.2tasktracker services and threads

Tasktracker is also one of the most important classes in the mapreduce framework. It runs on each datanode node and is used to schedule the actual running of tasks. It also starts some services and threads internally. Tasktracker also starts a statushttpserver service to provide a web interface tool for querying the task execution status. Secondly, it starts a taskreportserver service. The Subprocess that is provided to it is the maptask or reducetask started by taskrunner to report the status to it, sub-process startup commands are implemented in the tasktracker $ Child class by taskrunner. run () passes in the service address and port through the command line parameter, that is, call the gettasktrackerreportaddress () of tasktracker, which will be obtained when the taskreportserver service is created. Tasktracker also starts a mapeventsfetcherthread thread to obtain the output data of the map task.

2.5 job status monitoring

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.