Hadoop Learn more about the role of 5 processes

Source: Internet
Author: User
Tags http post


What is the nature of 1.job?
2. What is the nature of the task?
3. Who manages the namespace of the file system, and what is the role of namespace?
What is the role of the 4.Namespace image file (Namespace image) and the Action log file (edit log) file?
5.Namenode records the location information of the data nodes in each block in each file, but he does not persist the information, why?
6. Does the client pass Namenode when reading or writing a data?
What is the relationship between 7.namenode,datanode,namespace image,edit log?
8. Once a task fails, how does jobtracker handle it?
9.JobClient Jobclient After obtaining the ID assigned to the job by Jobtracker, a separate directory is created for the job under Jobtracker's system directory (HDFS), and the directory name is the job ID, which
Will include files such as Job.xml, Job.jar, and so on, what is the purpose of these two files?
10.JobTracker based on what can you get this job directory?
11.JobTracker Why should I check the memory before submitting the job?
12. What are the reasons for generating multiple Java Virtual machines (JVMs) per Tasktracker?








Overview:

<ignore_js_op>


Hadoop is a software framework that enables distributed processing of large amounts of data, implements Google's MapReduce programming model and framework, splits applications into small units of work, and places them on any cluster node. In MapReduce, an application ready to commit execution is called a job, while a unit of work that is divided from one job and runs on each compute node is called a task. In addition, the Distributed File System (HDFS) provided by Hadoop is responsible for the data storage of each node and achieves high throughput data reading and writing.

Hadoop is a Slave/slave (Master/slave) architecture for distributed storage and distributed computing. In a fully configured cluster, you want to run Hadoop as an elephant, you need a series of back-end (Deamon) programs running in the cluster. Different background programs play roles that are not used, and these roles are composed of NameNode, DataNode, secondary NameNode, Jobtracker, and Tasktracker. where NameNode, secondary NameNode, Jobtracker run on the master node, and on each slave node, a datanode and Tasktracker are deployed to This slave server runs a data handler that can handle native data as directly as possible. The master node needs to be specifically explained that in a small cluster, the secondary namenode can belong to a slave node, and in a large cluster, Namenode and Jobtracker are deployed separately on both servers.

We are already familiar with this 5 process, but in the process of use, we often encounter problems, then how to solve these problems. So first of all we need to understand their principles and roles.



1.Namenode Introduction

Namespace of Namenode Manager file system. It maintains metadata (metadata) for all files and folders in the file system tree (filesystem trees) and in the file tree. There are two files that manage this information, namely the Namespace image file (Namespace image) and the Action log file (edit log), which are stored in RAM and, of course, the two files are persisted to the local hard disk. Namenode records the location information of the data node where each block resides in each file, but he does not persist the information because it is rebuilt from the data node at system startup.

Namenode structure Drawing Lesson abstract to



<ignore_js_op>



The client interacts with Namenode and Datanode on behalf of the user to access the entire file system. The client provides some of the file system interfaces for the columns, so we do not need to know Datanode and Namenode when we are programming, so we can do the functions we want.

1.1Namenode fault tolerance mechanism

You can't work without Namenode,hdfs. In fact, if the machine running Namenode is broken, the files in the system will be completely lost, because there is no other way to reconstruct the file blocks (blocks) on different datanode. Therefore, Namenode's fault-tolerant mechanism is very important, and Hadoop provides two mechanisms.
The first way is to persist the file system metadata backup that is stored on the local hard disk. Hadoop can be configured to let Namenode write his persistent state files to different file systems. This write operation is synchronous and atomically. A more common configuration is to write to a remote mounted network file system while the persisted state is written to the local hard disk.
The second way is to run an auxiliary Namenode (secondary Namenode). In fact, secondary namenode cannot be used as namenode. Its primary role is to periodically merge the namespace image with the operation log file (edit log) to prevent the operation log file (edit log) from becoming too large. Typically, the secondary Namenode runs on a separate physical machine because the merge operation takes a lot of CPU time and Namenode equivalent memory. The secondary namenode holds a backup of the merged namespace image, which can be used in the event of a namenode outage on any one day.
But the auxiliary namenode always lags behind the main namenode, so data loss is unavoidable in the namenode outage. In this case, in general, to use the Namenode metadata file in the remote mounted Network File System (NFS) mentioned in the first way, copy the Namenode metadata file in NFS to the secondary Namenode, and the auxiliary Namenode as the main namenode to run.



--------------------------------------------------------------------------------------------------------------- -------------------------------------


2, Datanode Introduction

Datanode are the working nodes of the file system, they store and retrieve data based on the dispatch of the client or the Namenode, and periodically send the Namenode a list of blocks that they store. Each server in the cluster runs a datanode daemon, which is responsible for reading and writing the HDFS data blocks to the local file system. When it is necessary to read/write a data through the client, the Namenode tells the client which datanode to perform the specific read/write operation, then the client communicates directly with the background program on the Datanode server and reads/writes the relevant data block.




--------------------------------------------------------------------------------------------------------------- ------------------------------------


3, secondary Namenode introduction

secondary   Namenode is a secondary daemon used to monitor the status of HDFs. Just like NameNode, each cluster has a secondary  namenode and is deployed on a separate server. Unlike Namenode, Secondary  namenode does not accept or record any real-time data changes, but it communicates with Namenode in order to periodically save a snapshot of the HDFs metadata. Because the Namenode is a single point, the Namenode downtime and data loss can be minimized with the Secondary  namenode snapshot feature. At the same time, if a problem occurs with Namenode, Secondary  namenode can be used as a standby namenode in a timely manner.

3.1 namenode is as follows:

${dfs.name.dir}/current/version/edits/fs Image/fstime


3.2 the directory structure of the secondary Namenode is as follows:

${fs.checkpoint.dir}/current/version                                              & nbsp  /edits                        & nbsp                       /fsimage                               & nbsp                 /fstime         & nbsp                      /previous.checkpoint /version                                                              & nbsp;      /edits                  & nbsp                            & nbsp;                     /fsimage   & nbsp                            & nbsp;                          & nbsp;         /fstime

<ignore_js_op>

For example, secondary namenode is mainly done namespace image and edit log merge.

So what do these two files do? When the client performs a write operation, the Namenode is recorded in the edit log (I feel that the file is somewhat like the online Redo logo file for Oracle) and keeps a copy of the file system's metadata in memory.

The Namespace image (Fsimage) file is a persisted checkpoint of the file system metadata and is not updated immediately after the write operation because the Fsimage write is very slow (this is more like datafile).

Because the edit log is growing, when Namenode restarts, it will cause long time namenode in safe mode, unusable state, is very incompatible with the original design of Hadoop. Therefore, to periodically merge the edit log, but this work is done by NameNode, will occupy a lot of resources, so there is a secondary NameNode, it can do image checkpoint processing work. The steps are as follows: (1) Secondary Namenode request the Namenode to scroll through the edit log (that is, create a new edit log), record the new edits to the newly generated edit log file, and (2) through the HTTP GET method, Read the fsimage and edits files on the Namenode to Secondary Namenode, (3) read fsimage into memory, load Fsimage into memory, and then perform all operations in edits (like ORACLEDG, Apply redo log) and generate a new fsimage file, i.e. this checkpoint is created, (4) The new Fsimage file is transmitted to Namenode via HTTP POST, (5) Namenode replaces the original F with the new Fsimage Simage file, let (1) create the edits to replace the original edits file, and update the checkpoint time of the Fsimage file. The entire processing process is complete. The processing of secondary namenode is the merging of Fsimage and Edites file cycles, which does not cause a long time-inaccessible condition when the Namenode is restarted.

--------------------------------------------------------------------------------------------------------------- -----------------------------------


4, Jobtracker Introduction

Jobtracker daemon is used to connect applications with Hadoop. After the user code is submitted to the cluster, the Jobtracker determines which file will be processed and assigns nodes to different tasks. At the same time, it monitors all tasks, and once a task fails, Jobtracker automatically re-opens the task, and in most cases the task is placed on the unused node. Each Hadoop cluster has only one jobtracker, which typically runs on the master node of the cluster.

Here we describe in detail:


4.1JobClient

After we have configured the job, we can submit the job to Jobtracker, and then Jobtracker can schedule the appropriate tasktracker to complete the job. So what did MapReduce do in the process? This is the article and the next piece of blog post will be discussed, of course, this article is mainly around the client during the job submission process to expand. Grasp the process from the global first!



In Hadoop, jobs are abstracted using the Job object, and for the job, I first have to introduce it to a big guy jobclient--the actual worker of the client. Jobclient is also responsible for interacting with Jobtracker, in addition to completing some of the necessary work on its own. Therefore, the client to the job submission, most of them are jobclient completed, from which we can be informed that jobclient submitted the job's detailed process is mainly as follows:



Jobclient after obtaining the ID assigned to the job by Jobtracker, a separate directory is created for the job under Jobtracker's system directory (HDFS), and the directory name is the job ID, which contains the file Job.xml, Job.jar, Job.split and so on, wherein, Job.xml file records the job details configuration information, Job.jar saved the user-defined about the job map, reduce manipulation, Job.split saved the job task shard information. In the flowchart above, I would like to elaborate on whether jobclient is the operating environment for any configuration job and how to slice the input data of the job.



4.2Jobtracker

The above refers to the client's jobclient to a job to do the work of the submission, then here, we need to talk about Jobtracker for the submission of the job in the end to do those things--one. Create a job for the assignment; second, accept the job.

As we all know, the jobclient of the client saves all relevant information about the job to the Jobtracker system directory (HDFS, of course), and one of the biggest benefits is that the client does what it does and reduces the load on the server-side jobtracker. Here's a look at how Jobtracker is going to complete the submission of client jobs! Oh. By the right, here I have to mention that the client's jobclient to Jobtracker formally submitted the job to the Jobid, because all the information related to the job already exists in the Jobtracker system directory, Jobtracker can get the job directory as long as it's jobid.


<ignore_js_op>



For the submission process of the job above, I will briefly describe the following procedures:

1. Create a job jobinprogress

The Jobinprogress object details the job configuration information and how it is executed, specifically the map and reduce tasks that the job is decomposed into. During the creation of the Jobinprogress object, it mainly did two things, one is to copy the job's job.xml, Job.jar files from the job directory to the Jobtracker local file system (job.xml->*/jobtracker/ Jobid.xml,job.jar->*/jobtracker/jobid.jar), the second is to create jobstatus and job Maptask, reducetask the queue to track the status information of the job.



2. Check whether the client has permission to submit the job

Jobtracker verifies that the client has permission to submit the job is actually given to QueueManager to handle.



3. Check that the current MapReduce cluster meets the job's memory requirements

Before the client submits the job, the memory requirements of the job task are configured according to the actual application, and jobtracker in order to increase the throughput of the job to limit the memory requirements of the job task, so when the job is submitted, Jobtracker needs to check if the job's memory requirements meet Jobtracker's settings.


The above process has been completed, can be summed up as:

<ignore_js_op>

--------------------------------------------------------------------------------------------------------------- -----------

5, Tasktracker introduction


Tasktracker is combined with the datanode responsible for storing data, and its processing structure also follows the master/slave architecture. Jobtracker is located in the master node, commanding the MapReduce work, while Tasktrackers is located from the node, independently managing the respective task. Each tasktracker is responsible for executing the specific task independently, while Jobtracker is responsible for assigning the task. Although each slave node has only one tasktracker, each tasktracker can produce multiple Java virtual machines (JVMS) for parallel processing of multiple map and reduce tasks. An important function of tasktracker is to interact with Jobtracker. If Jobtracker cannot get Tasktracker submitted information on time, Jobtracker determines that Tasktracker has crashed and assigns the task to other nodes for processing.


5.1internal design and implementation of Tasktracker

Hadoop uses Master-slave architecture design to implement Map-reduce framework, its Jobtracker node as the master node to manage and dispatch user-submitted jobs, The Tasktracker node is responsible for performing the map/reduce tasks assigned by the Jobtracker node as a work node. The entire cluster consists of a Jobtracker node and several tasktracker nodes, and of course, the Jobtracker node is responsible for managing the Tasktracker nodes. In the previous series of posts, I have systematically described the design and implementation of the Jobtracker node inside, and in this article, I will make a comprehensive overview of the internal design and implementation of the Tasktracker node.

The Tasktracker node, as a working node, not only has to interact with jobtracker nodes frequently to get job tasks and is responsible for executing them locally, but also interacting with other tasktracker nodes to complete the same job. Therefore, in the current Hadoop-0.20.2.0 implementation version, the design of the work node Tasktracker mainly consists of three types of components: service components, management components, workgroup pieces. The service component is responsible not only with other Tasktracker nodes but also with the communication service between the Jobtracker nodes, which manages the tasks, jobs, JVM instances, and memory on that node, and the workgroup is responsible for scheduling the execution of map/reduce tasks. The detailed composition of these three components is as follows:

<ignore_js_op>

The following is a detailed description of these three types of components:

Service Components

  Service components within the    tasktracker node are not only used to serve tasktracker nodes, clients, It is also responsible for requesting services to the Tasktracker node, which mainly includes the three components of Httpserver, Taskreportserver, and Jobclient. The

1.HttpServer

     tasktracker node uses the jetty Web container inside it to turn on the HTTP service. This HTTP service is used to provide the client with a task Log query service, and the second is to provide a data transfer service, that is, when the reduce task is performed through the Tasktracker node provided by the HTTP service to obtain their own map output data. In detail here is the configuration parameters related to the service, the cluster Manager can configure the service address and port number through the configuration file of the Tasktracker node, the corresponding configuration entry is: Mapred.task.tracker.http.address. At the same time, in order to be able to control the throughput of the service flexibly, the manager can also set the number of internal worker threads of the HTTP service, the corresponding configuration is: Tasktracker.http.threads.

2.Task Reporter

       tasktracker node after receiving Map/reduce task sent by Jobtracker node, will be handed over to the JVM instance, and you need to collect the execution progress information for those tasks, which makes it necessary to continuously report the current execution to the Tasktracker node when the task executes in the JVM instance. Although the Tasktracker node and the JVM instance are on the same machine, the process communication between them is done through network I/O (the performance of this communication is not discussed here). That is, the Tasktracker node opens a port inside it to provide a progress reporting service specifically for the task instance. The service address can be set by configuration item mapred.task.tracker.report.address, and the number of worker threads within the service is twice times larger than the number of map/reduce slots on that Tasktracker node.


Hadoop Learn more about the role of 5 processes

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.