Data processing framework in Hadoop 1.0 and 2.0-MapReduce

Source: Internet
Author: User
Tags hadoop mapreduce

1. MapReduce-mapping, simplifying programming model

Operating principle:

2. The implementation of MapReduce in Hadoop V1 Hadoop 1.0 refers to Hadoop version of the Apache Hadoop 0.20.x, 1.x, or CDH3 series, which consists mainly of HDFs and MapReduce systems, where MapReduce is an offline processing framework consisting of the programming model (the old and new APIs), the runtime Environment (Jobtracker and Tasktracker), and the Data processing engine (Maptask and Reducetask). 2.1 Implementation of MapReduce in Hadoop V1
    • NameNode records how files are split into blocks and the blocks are stored in those datenode nodes. NameNode also holds the status information of the file system running.
    • The datanode is stored in a split blocks.
    • Secondary NameNode help NameNode collect state information about the file system run.
    • Jobtracker is responsible for job execution when a task is submitted to the Hadoop cluster, and is responsible for scheduling multiple tasktracker.
    • Tasktracker is responsible for a certain map or reduce task.  
MapReduce principle: 1. Data distribution storage: HDFS has a name node NameNode and N data nodes, each of which is a normal computer. The HDFS layer splits the files into blocks, which are then distributed to different Datanode on a separate block. Each block can also replicate several copies stored on different datanode to achieve fault tolerant disaster tolerance. NameNode is the core of HDFS, which records how many blocks each file is cut into, what datanode the blocks are scattered on, and what the state of each datanode is. 2. Distributed parallel computing. Hadoop has a jobtracker as a master for scheduling and managing Tasktracker. The Jobtracker can run on any node in the cluster. Tasktracker is responsible for performing tasks, and it must be running on DataNode, meaning that DataNode is both a data storage node and a compute node. Jobtracker sends the map and reduce tasks to the idle tasktracker, allowing them to run and monitor their operation. In the case of a tasktracker failure, Jobtracker will transfer its task to another idle tasktracker. 3. Local computing: The node on which the data is stored, which node makes the calculation of this part of the data, so as to reduce the data on the network transmission, reduce the network bandwidth requirements. "Local Computing" is one of the most effective means of saving network bandwidth. 4. Task granularity: When raw big data is cut into small datasets, the data set is typically less than or equal to the size of one block in HDFs (by default, 64M), ensuring that a dataset is located on a single computer for easy local computing. M small datasets, starting m map tasks, which are distributed on N machines that run in parallel and reduce the number of tasks R is specified by the user. 5. Data segmentation (Partition): The output of the map task is divided into r parts according to the scope of the key, R is the number of pre-defined reduce tasks. 6. Data merge (Combine): Before the data is divided, the intermediate results can be merged into data, and the <key,value> of the same key in the intermediate result will be merged into a pair. Combine is part of the map task, which can reduce data transfer traffic. 7. The result of the Reduce:map task is present on the local disk as a file. The location of the intermediate result file informs Jobtracker,jobtracker to notify the reduce task againTo which Datanode go up and take the intermediate result. Each reduce needs to reach a number of map task nodes with intermediate results that fall within its responsible key interval, and then execute the reduce function. 8. Task pipeline: With R reduce task, there will be r final result. Sometimes this r result does not need to be merged into one final result, because the R result can be used as input to another computational task, and another parallel computing task is started, which forms the task pipeline. Resource management for  2.2 MapReduce

In Hadoop V1, MapReduce also has resource management capabilities in addition to data processing.

http://dongxicheng.org/mapreduce-nextgen/hadoop-1-and-2-resource-manage/

Hadoop 1.0 Resource Management consists of two parts: the resource representation model and the resource allocation model, where the resource representation model is used to describe how the resources are organized, Hadoop 1.0 uses the resources on each node in the slot (slot), and the resource allocation model determines how the resources are assigned to each job/ Task, in Hadoop, this part is done by a plug-in scheduler.

Hadoop introduces a "slot" concept that represents compute resources on individual nodes. To simplify resource management, Hadoop divides the resources (CPU, memory, disk, and so on) of each node into portions, each in a single slot, and specifies that a task can occupy multiple slots depending on the actual needs. By introducing the concept of "slot", Hadoop simplifies resource management issues by simplifying the abstraction of multidimensional resources into a single resource (that is, a slot).

Further, the slot is equivalent to the task running "license", a task can only get the "license" after the opportunity to run, which also means that the number of slots on each node determines the maximum allowable task concurrency on that node. To differentiate between the amount of resources used by the map task and the reduce task, the slots are divided into the map slot and the reduce slot two, which can only be used by the map task and the reduce task, respectively. Hadoop cluster administrators can assign different map slots (specified by parameter mapred.tasktracker.map.tasks.maximum) and reduce to each node's hardware configuration and application characteristics The number of slots (specified by the parameter mapred.tasktrackerreduce.tasks.maximum).

Resource management in Hadoop 1.0 has several drawbacks:

(1) static resource configuration . The static resource setting policy is used, that is, each node is configured with the total number of slots available, and the number of slots cannot be modified dynamically once they are started.

(2) resources cannot be shared . Hadoop 1.0 divides the slots into the map slot and the reduce slot two, and does not allow sharing. For a job, when it starts running, the map slot resource is scarce and the reduce slot is idle, and when the map task is complete, the reduce slot is scarce and the map slot is idle. It is obvious that this resource management scheme, which differentiates slot classes, reduces the utilization of slots to some extent.

(3) The resource divides the grain to spend the big . This classification of resource partitioning based on classless slots is still too coarse, often resulting in high or low node resource utilization, for example, the administrator plans the good one slot to represent 2GB of memory and a CPU, if the task of an application requires only 1GB of memory, it will produce "resource fragmentation", This reduces the utilization of cluster resources, similarly, if an application's task requires 3GB of memory, it implicitly grabs resources from other tasks, resulting in resource preemption, which can lead to high cluster utilization.

(4) no effective resource isolation mechanism was introduced . Hadoop 1.0 uses only the JVM-based resource isolation mechanism, which is still too coarse, and many resources, such as CPUs, cannot be isolated, which can cause serious interference between tasks on the same node.

The limitations of the 2.3 MapReduce architecture show that the original Map-reduce architecture is straightforward, and in the first few years, many successful cases have been obtained, with the industry's broad support and affirmation, but with the scale of distributed systems clusters and the growth of their workloads, The problems of the original framework surfaced gradually, with the main problems being concentrated as follows:
1. Jobtracker is the central processing point of the map-reduce, and there is a single point of failure.
2. Jobtracker completed too many tasks, resulting in excessive resource consumption, when the map-reduce job is very large, will cause a lot of memory overhead, potentially, also increased the risk of jobtracker fail, which is the industry generally summed up the old Hadoop M Ap-reduce can only support the upper limit of 4000-node hosts.
3. On the Tasktracker side, the number of map/reduce tasks as a resource representation is too simple to take into account the cpu/memory footprint, and if two large memory-consuming tasks are dispatched to a piece, OOM is easy to appear.
4. On the Tasktracker side, the resource is coerced into the map task slot and the reduce task slot, which can be wasteful if only the map task or reduce task is in the system, which is the cluster resource mentioned earlier Use of the problem.
5. Source code level analysis, you will find the code is very difficult to read, often because a class did too many things, the code volume of more than 3,000 lines, resulting in a class task is not clear, increase the difficulty of bug repair and version maintenance.
6. From an operational point of view, the current Hadoop MapReduce framework enforces system-level upgrade updates when there are any important or unimportant changes, such as bug fixes, performance improvements, and characterization. What's worse, it does not matter what the user likes, forcing each client side of the distributed cluster system to update at the same time. These updates will allow users to waste a lot of time trying to verify that their previous application is applying a new version of Hadoop. 3. The MapReduce in Hadoop 2.0 originates from the MRV1 (traditional Hadoop MR) described above, such as:
    • Limited extensibility;
    • Jobtracker single point of failure;
    • It is difficult to support calculations other than Mr;
    • Multi-computing framework fighting each other, data sharing difficulties, such as Mr (offline computing framework), storm real-time computing framework, Spark Memory computing framework is difficult to deploy on the same cluster, resulting in data sharing difficulties, etc.
Hadoop V2The fundamental idea of refactoring is to separate the Jobtracker two main functions into separate components , two of which areResource Managementand thetask scheduling/monitoring.
    • the new Resource Manager YARN globally manage the allocation of all application computing resources, and each application's applicationmaster is responsible for scheduling and coordinating accordingly. An application is nothing more than a single traditional MapReduce task or a DAG (directed acyclic graph) task. The node Management Server for ResourceManager and each machine can manage the user's processes on that machine and can organize the calculations.
    • The dispatch and monitoring of tasks is still the responsibility of MapReduce.
Traditional MapReduce dispatch on yarn: note: All of the above are from the Internet.

Data processing framework in Hadoop 1.0 and 2.0-MapReduce

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.