Hadoop Parallel Computing principle and distributed concurrent programming _

Hadoop Parallel Computing principle and distributed concurrent programming __ Programming

Last Update:2018-07-24 Source: Internet

Author: User

Tags data structures hash

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The distributed system that we usually say is a distributed software system, which is a software system that supports distributed processing, which performs tasks on the multiprocessor architecture of communication network interconnection, including distributed operating system, distributed programming language and its compiling (interpretation) system, Distributed File system and distributed database system, etc. Hadoop is the software of file system in distributed software system, which realizes the function of distributed file system and partial distributed database. The Distributed File System HDFs in Hadoop enables efficient storage and management of data in a cloud of computer clusters, and the parallel programming framework MapReduce in Hadoop makes it easier for user-authored Hadoop parallel applications to run. The following is a brief introduction to the knowledge of distributed concurrency programming based on Hadoop:

Parallel application development on Hadoop is based on the MapReduce programming framework. The MapReduce programming model works by using an input key/value pair set to produce an output key/value pair set. Users of the MapReduce library use two functions to express this calculation: map and reduce.

The user-defined map function receives an input key/value pair and then produces a set of intermediate key/value pairs. MapReduce sets all values with the same key value together and passes them to the reduce function.

The user-defined reduce function receives the key and the associated value collection. The reduce function merges these value values to form a smaller value collection. In general, each reduce function call yields a value of only 0 or 1 outputs. Usually we supply the middle value value to the reduce function through an iterator, so that we can handle a large collection of value values that cannot be put into memory.

The following diagram is a mapreduce data flow diagram, which simply decomposes large datasets into hundreds of small datasets, each of which (or several) datasets are processed by one node in the cluster (typically a normal computer) and produce intermediate results. These intermediate results are then combined by a large number of nodes to form the final result. The following figure also points out the three main functions in a parallel program under the MapReduce framework: map, reduce, main. In this structure, the work that needs to be done by the user is simply to write the map and reduce two functions according to the task.

▲ diagram of MapReduce's streaming chart

The MapReduce computing model is ideal for running parallel on large clusters of computers. Each map task and each reduce task in the image above can be run on a single compute node at the same time, it is conceivable that the operation efficiency is very high, so how to do this parallel computing? Here is a brief introduction to the principle.

　　1. Data Distribution Storage

Hadoop Distributed File System (HDFS) consists of a name node (NameNode) and N data nodes (DataNode), each of which is a normal computer. In use, HDFs is very similar to our familiar stand-alone file system, which can create directories, create, copy, and delete files, and view the contents of files. However, the HDFs bottom of the file cut into blocks, and then the block is stored scattered on different DataNode, each block can also replicate a number of data stored on different DataNode, to achieve fault tolerant disaster. NameNode is the core of the entire HDFS, which maintains some data structures to record how many blocks each file is cut into, which DataNode the blocks can be obtained from, and the status of each DataNode, among other important information.

　　2. Distributed Parallel Computing

There is one jobtracker in Hadoop that is used to dispatch and manage other tasktracker,jobtracker that can run on any computer in the cluster. Tasktracker is responsible for the task, which must run on DataNode, which means that DataNode is both a data storage node and a compute node. Jobtracker distributes the map task and the reduce task to the idle Tasktracker, allowing these tasks to run in parallel and to monitor the running of the task. If one of the tasktracker fails, Jobtracker will transfer the task it is responsible for to another idle tasktracker to rerun.

　3. Local computing

On which computer the data is stored, the computer makes the calculation of this part of the data, which reduces the transmission of data over the network and reduces the need for network bandwidth. In the cluster-based distributed parallel system such as HADOOP, compute nodes can be easily expanded, it can provide almost unlimited computing power, but because the data need to flow between different computers, so the network bandwidth becomes a bottleneck, "local computing" is one of the most effective means of saving network bandwidth, The industry describes this as "mobile computing is more economical than mobile data".

　　4. Granularity of tasks

When cutting the original large data set into small datasets, it is common for small datasets to be less than or equal to the size of one Block in HDFS (by default, 64MB), which guarantees that a small dataset is located on a single computer and is easy to compute locally. With m small data sets to be processed, start the M map task, note that the M map tasks are distributed on N machines, they run in parallel, and the number of reduce tasks R can be specified by the user.

　　5. Data segmentation (Partition)

The intermediate results of the map task output are divided into R parts by the range of key (R is the number of pre-defined reduce tasks), usually using hash function (e.g. hash (key) mod R), so that a certain range of keys must be handled by a reduce task , you can simplify the process of Reduce.

　　6. Data Consolidation (Combine)

Before the data is split, it is possible to merge the intermediate results (Combine) with the same key pair in the intermediate results. The process of Combine is similar to the process of reduce, and in many cases the reduce function can be used directly, but Combine is part of the map task and is executed shortly after the map function is executed. Combine can reduce the number of pairs in intermediate results, thus reducing network traffic.

7. Reduce

The intermediate result of the MAP task is stored on the local disk as a file after Combine and Partition are done. The location of the intermediate result file notifies the master Jobtracker,jobtracker and then notifies the reduce task to which DataNode to take the intermediate result. Note that all the intermediate results generated by the map task are divided into r parts by their key value using the same hash function, and the R-Reduce task is responsible for a key interval. Each reduce needs to get an intermediate result that falls within its responsible key interval to many map task nodes, and then executes the reduce function to form a final result file.

　　8. Task Pipeline

With R-Reduce tasks, there will be r final results, and in many cases this R final result does not need to be merged into one final result, because the R final result can also be used as input to another computing task, which also forms a task pipeline.

Here is a brief introduction to the principles, processes, program structures, and parallel computing implementations of the MapReduce programming model in Hadoop in parallel programming, the detailed flow of the MapReduce program, the programming interface, Examples of programs will be put up in subsequent articles.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More