Hadoop Learning notes, mapreduce task Namenode DataNode jobtracker tasktracker Relationship

Source: Internet
Author: User

First, the basic concept

In MapReduce, an application that is ready to commit execution is called a job, and a unit of work that is divided from one job to run on each compute node is called a task. In addition, the Distributed File System (HDFS) provided by Hadoop is responsible for the data storage of each node and achieves high throughput data reading and writing.

Hadoop is a master/slave (Master/slave) architecture for distributed storage and distributed computing. On a fully configured cluster, you want to run Hadoop as an elephant, running a series of background programs in the cluster. Different background programs play roles that are not used, and these roles are composed of NameNode, DataNode, secondary NameNode, Jobtracker, and Tasktracker. where NameNode, secondary NameNode, Jobtracker run on the master node, and on each slave node, a datanode and Tasktracker are deployed, So that the data handler that the slave server runs can handle the data of the machine as directly as possible. The master node needs to be specifically explained that in a small cluster, the secondary namenode can belong to a slave node, and in a large cluster, Namenode and Jobtracker are deployed separately on both servers.

Second, Namenode

Namenode manages the namespace of the file system (Namespace). It maintains metadata (metadata) for all files and folders in the file system tree (filesystem trees) and in the file tree. There are two files that manage this information, namely the Namespace image file (Namespace image) and the Action log file (edit log), which are stored in RAM and, of course, the two files are persisted to the local hard disk. Namenode records the location information of the data node where each block resides in each file, but he does not persist the information because it is rebuilt from the data node at system startup.

Namenode structure Drawing Lesson abstract to

The client interacts with Namenode and Datanode on behalf of the user to access the entire file system. The client provides some of the file system interfaces for the columns, so we do not need to know Datanode and Namenode when we are programming, so we can do the functions we want.

2.1 Namenode Fault Tolerant mechanism

You can't work without Namenode,hdfs. In fact, if the machine running Namenode is broken, the files in the system will be completely lost, because there is no other way to reconstruct the file blocks (blocks) on different datanode. Therefore, Namenode's fault-tolerant mechanism is very important, and Hadoop provides two mechanisms.

The first way is to persist the file system metadata backup that is stored on the local hard disk. Hadoop can be configured to let Namenode write his persistent state files to different file systems. This write operation is synchronous and atomically. A more common configuration is to write to a remote mounted network file system while the persisted state is written to the local hard disk.

The second way is to run an auxiliary Namenode (secondary Namenode). In fact, secondary namenode cannot be used as namenode. Its primary role is to periodically merge the namespace image with the operation log file (edit log) to prevent the operation log file (edit log) from becoming too large. Typically, the secondary Namenode runs on a separate physical machine because the merge operation takes a lot of CPU time and Namenode equivalent memory. The secondary namenode holds a backup of the merged namespace image, which can be used in the event of a namenode outage on any one day.

But the auxiliary namenode always lags behind the main namenode, so data loss is unavoidable in the namenode outage. In this case, in general, to use the Namenode metadata file in the remote mounted Network File System (NFS) mentioned in the first way, copy the Namenode metadata file in NFS to the secondary Namenode, and the auxiliary Namenode as the main namenode to run.

Third, Datanode

Datanode are the working nodes of the file system, they store and retrieve data based on the dispatch of the client or the Namenode, and periodically send the Namenode a list of blocks that they store.
Each server in the cluster runs a datanode daemon, which is responsible for reading and writing the HDFS data blocks to the local file system. When it is necessary to read/write a data through the client, the Namenode tells the client which datanode to perform the specific read/write operation, then the client communicates directly with the background program on the Datanode server and reads/writes the relevant data block.

You might ask a question, how does Hadoop guarantee consistency with so many servers? The problem is simple because Hadoop is written once and cannot be modified, so there is no consistency problem

Iv. Secondary Namenode

As the name implies, is the second namenode. The functions include the following two aspects:

Backup: The primary role is to back up Namenode.

Merge log files: Periodically merge image files (fsimage) and operation logs (edit log). is actually to help Namenode implement the operation log regularly integrated into the image file

Wu, Jobtracker

Jobtracker, as the name implies, performs a job task. After the user code is submitted to the cluster, the Jobtracker determines which file will be processed and assigns nodes to different tasks. At the same time, it monitors all tasks, and once a task fails, Jobtracker automatically re-opens the task, and in most cases the task is placed on the unused node. Each Hadoop cluster has only one jobtracker, which typically runs on the master node of the cluster. Here we describe in detail:

5.1 Jobclient

After we have configured the job, we can submit the job to Jobtracker, and then Jobtracker can schedule the appropriate tasktracker to complete the job. So what did MapReduce do in the process? This is the issue to be discussed in this article, of course, this article is mainly around the client during the job submission process to expand. Grasp the process from the global first!

In Hadoop, jobs are abstracted using the Job object, and for the job, I first have to introduce it to a big guy jobclient--the actual worker of the client. Jobclient is also responsible for interacting with Jobtracker, in addition to completing some of the necessary work on its own. Therefore, the client to the job submission, most of them are jobclient completed, from which we can be informed that jobclient submitted the job's detailed process is mainly as follows:

Jobclient after obtaining the ID assigned to the job by Jobtracker, a separate directory is created for the job under Jobtracker's system directory (HDFS), and the directory name is the job ID, which contains the file Job.xml, Job.jar, Job.split, etc., wherein, Job.xml file records the details of the job configuration information, Job.jar saved the user-defined map of the job, reduce operations, Job.split saved the job task of the Shard information. In the flowchart above, I would like to elaborate on whether jobclient is the operating environment for any configuration job and how to slice the input data of the job.

5.2 Jobtracker

The above refers to the client's jobclient on a job to commit the work done, then the following will be a good talk about Jobtracker for the job submission in the end did those things: 1. Generates a job;2 for the job. Accept the job.

As we all know, the jobclient of the client saves all relevant information about the job to the Jobtracker system directory (HDFS, of course), and one of the biggest benefits is that the client does what it does and reduces the load on the server-side jobtracker. Here's a look at how Jobtracker is going to complete the submission of client jobs! Oh. By the right, here I have to mention that the client's jobclient to Jobtracker formally submitted the job to the Jobid, because all the information related to the job already exists in the Jobtracker system directory, Jobtracker can get the job directory as long as it's jobid.

For the submission process of the job above, I will briefly describe the following procedures:

1. Create a job jobinprogress

The Jobinprogress object details the job configuration information and how it is executed, specifically the map and reduce tasks that the job is decomposed into. During the creation of the Jobinprogress object, it mainly did two things, one is to copy the job's job.xml, Job.jar files from the job directory to the Jobtracker local file system (job.xml->*/jobtracker/ Jobid.xml,job.jar->*/jobtracker/jobid.jar), the second is to create jobstatus and job Maptask, reducetask the queue to track the status information of the job.

2. Check whether the client has permission to submit the job

Jobtracker verifies that the client has permission to submit the job is actually given to QueueManager to handle.

3. Check that the current MapReduce cluster meets the job's memory requirements

Before the client submits the job, the memory requirements of the job task are configured according to the actual application, and jobtracker in order to increase the throughput of the job to limit the memory requirements of the job task, so when the job is submitted, Jobtracker needs to check if the job's memory requirements meet Jobtracker's settings.

The above process has been completed, can be summed up as:

Vi. introduction of Tasktracker

Tasktracker is combined with the datanode responsible for storing data, and its processing structure also follows the master/slave architecture. Jobtracker is located in the master node, commanding the MapReduce work, while Tasktrackers is located from the node, independently managing the respective task. Each tasktracker is responsible for executing the specific task independently, while Jobtracker is responsible for assigning the task. Although each slave node has only one tasktracker, each tasktracker can produce multiple Java virtual machines (JVMS) for parallel processing of multiple map and reduce tasks. An important function of tasktracker is to interact with Jobtracker. If Jobtracker cannot get Tasktracker submitted information on time, Jobtracker determines that Tasktracker has crashed and assigns the task to other nodes for processing.

Internal design and implementation of 6.1 Tasktracker

Hadoop uses Master-slave architecture design to implement Map-reduce framework, its Jobtracker node as the master node to manage and dispatch user-submitted jobs, The Tasktracker node is responsible for performing the map/reduce tasks assigned by the Jobtracker node as a work node. The entire cluster consists of a Jobtracker node and several tasktracker nodes, and of course, the Jobtracker node is responsible for managing the Tasktracker nodes.

The Tasktracker node, as a working node, not only has to interact with jobtracker nodes frequently to get job tasks and is responsible for executing them locally, but also interacting with other tasktracker nodes to complete the same job. Therefore, in the current Hadoop-0.20.2.0 implementation version, the design of the work node Tasktracker mainly consists of three types of components: service components, management components, workgroup pieces. The service component is responsible not only with other Tasktracker nodes but also with the communication service between the Jobtracker nodes, which manages the tasks, jobs, JVM instances, and memory on that node, and the workgroup is responsible for scheduling the execution of map/reduce tasks. The detailed composition of these three components is as follows:

Hadoop Learning notes, mapreduce task Namenode DataNode jobtracker tasktracker Relationship

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.