Data-intensive Text Processing with mapreduce chapter 2nd: mapreduce BASICS (2)

Source: Internet
Author: User
Directory address for this book Note:
2.3 execution framework

The greatest thing about mapreduce is that it separates parallel algorithm writing.WhatAndHow(You only need to write a program without worrying about how to execute it)The execution framework makes great contributions to this point: it handles almost all the details of the underlying execution, and can ensure that the mapreduce cluster can expand from several nodes to thousands of nodes. Specifically, it includes the following responsibilities:

1) Scheduling

Each mapreduce job is divided into multiple tasks ). For example, the map task is responsible for processing a bunch of Input key-value pairs (which are input partitions in hadoop), and the reduce task is responsible for processing the key-value pairs of intermediate results. In mapreduce, it is normal that thousands of tasks are waiting for allocation.For some large-scale computing programs, the number of tasks waiting for allocation may even exceed the number of tasks that can be executed simultaneously (concurrently). As a result, there is a need for scheduling.. Specifically, there are two aspects: 1.
For a single job, when there are too many tasks to be executed, you need to maintain the task queue and process the execution sequence; 2. when multiple jobs from different users are executed simultaneously, the controllability, stability, and correctness of shared access must be ensured.

Speculative execution is an optimization measure for Task Scheduling in mapreduce (which is implemented by both Google mapreduce and hadoop.

The prediction execution solves the straggler problem. Describe the LEADER: the execution time of the map stage or reduce stage depends on the time of the slowest execution task. This slowest task is called "straggler )". There are many reasons for leeching: hardware problems (some machines are faulty but can still work correctly, but become slow), uneven data Division (resulting in some small tasks, and some large tasks.

The prediction execution solution is to execute a copy of the same task on the other machine. When one of the two tasks is completed, the task is deemed to have been completed.

2) collaborative data/code placement

The essential difference between mapreduce and traditional parallel programming models is that it moves program code: mapreduce distributes programs to various data nodes for execution.. The core idea of doing so is to make better use of data locality and avoid data movement. Locality is divided into many layers based on the cluster structure, depending on the amount of data to be accessed and network bandwidth/latency:

Data volume: Single-host <rack <data center

Bandwidth/latency: Single Machine> single machine in the same rack> intra-data center> inter-Data Center

3) Synchronization

This part is aboutExecution order between the intermediate result generated by map and the intermediate result accepted by reduceProblem.

Generally, when multiple accesses to shared resources are "met", we should consider the synchronization issue. For mapreduce, this shared resource is the intermediate result: mapper produces the intermediate result, and the intermediate result is consumed immediately after CER Cer. Let's review the mapreduce Execution Process: After the Mapper processes the key-value pair, It outputs the key-value pair of the intermediate result. After the key-value pair of the intermediate result is grouped and sorted by the key, input CER for processing. The following are the facts in two mapreduce projects:

  1. In mapreduce, the reduce task cannot be started before all intermediate result key-value pairs are generated and grouped and sorted;
  2. Mapreduce intermediate result key-value pairs are transmitted over the network (MMapper →NReducers, requiredM×NTransmission Operations ).

However, the efficiency of transmitting computing results after all mapper execution is complete is very low. Therefore, hadoop has made common optimizations: the transmission of MAP and intermediate result key-value pairs is asynchronous and staggered (edge map, edge transmits the generated intermediate result to Cer CER ); mapper-CER follows a strict execution order (reducer starts after all mapper is completed and intermediate results are transmitted ).

4) handle errors and faults

Mapreduce runs on a cluster composed of a large number of common PCs. In such an environment,Single point of failure (spof) is common..

Hardware: disk fault, memory error, data center inaccessible (planned: hardware upgrade; unplanned: Network disconnection, power failure)

Software Error

2.4 partitioner and combiner)

Through the first three sections, I have a basic understanding of mapreduce. Next I will introduce the splitter and merge. With these two elements, the mapreduce programming model is basically complete.

1) Splitter

The main task of the splitter is to set intermediate resultsThe key-value pair is divided to determine the CER to which it is allocated.

The simplest way is to modulo the key value after hash, and the result is used as the CER number:

Assume thatRReducers, numbered 0, 1...R-1, there will be CERIPartition (K1,V1), whereI= Hash (K1) modR

This division method ensures the uniformity of Key Distribution (the number of key values held by each CER is almost the same), but does not ensure the uniformity of key-value pairs, therefore, data skew may occur in some situations:

For example, for word statistics, reducer1 is allocated to all ("hadoop", 1), and reducer2 is allocated to all ("the", 1), although from the key, the number of keys held by reducer1 and reducer2 is the same (both are 1), but since "hadoop" is much less frequently than "the", ("hadoop", 1) the number is much less than ("the", 1), resulting in reducer1 only allocated to a small amount of data, while reducer2 allocated to a large amount of data.

2) Merge

The merge operator is an optimization method.After er outputs intermediate results and groups and sorts intermediate results, it performs a partial merge operation.For example, for word statistics, You can merge the words generated by mapper before sending them out. For example, if mapper1 is generated ("C", 3), ("C", 6), The combiner merges it into ("C", 9) and then outputs it. In this way, the number of key-value pairs output by each mapper is equal to the number of different words contained in the document processed by this mapper.


Figure 2.4 complete mapreduce Model

(Word statistics program with the merge operator added)

In this sense, the merge operator can be considered as a mini CER, which processes data limited to the output of a mapper. Reasonable Use of the combiner can improve the algorithm efficiency. The complete mapreduce programming model is shown in Figure 2.4.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.