Topic Center

Contact Sales

Home > Hot Categories > Big Data

Hadoop in the Big Data Era (iii): Hadoop Data Flow (life cycle)

Last Update:2014-12-22 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

To understand Hadoop, you first need to understand the data flow of Hadoop, just as you know the life cycle of a servlet. Hadoop is a distributed storage (HDFS) and Distributed Computing Framework (MapReduce), but Hadoop also has a very important feature:Hadoop will move the mapreduce computation to each machine that stores some data .

Terminology The MapReduce Job (Job) is a unit of work that the client needs to perform: it includes input data, MapReduce programs, and configuration information. Hadoop executes the job into several small tasks, including two types of tasks: the map task and the reduce task .

There are two types of nodes that control the job execution process: a Jobtracker and a series of Tasktracker. Jobtracker coordinates all jobs running on the system by dispatching tasks running on the Tasktracker. Tasktracker The running progress report to Jobtracker,jobtracker to record the overall progress of each job task while running the task. If one of the tasks fails, Jobtracker can reschedule the task on another Tasktracker node.
input Hadoop divides the input data of mapreduce into equal-length small chunks of data, called input shards or shards . Hadoop builds a map task for each shard , and the task runs the user-defined map function to process each record in the Shard.
For most jobs, a reasonable shard size tends to be the size of a block in HDFs, which is 64M by default, but this default can be adjusted for the cluster. The size of the shards must be based on the tasks that are running, and if the shards are too small, the total time to manage the shards and the total time to build the map task will determine the overall execution time of the job.

Hadoop runs the map task on the node where the input data is stored, and can get the best performance, which is known as data localization optimization . Because blocks are the smallest unit of data that HDFs stores, each block can exist simultaneously (back up) on multiple nodes, and a file is partitioned into chunks that are randomly divided across multiple nodes, so if the input shards of a map task span multiple chunks, So there is basically no node that can have these contiguous blocks of data at the same time, then the map task will need to remotely replicate the data blocks that are not present on this node to this node and then run the map function through the network, so this task is obviously very inefficient.
  OutputThe Map task writes its output to a local disk, not to HDFs. This is because the output of the map is an intermediate result: the intermediate result is processed by the reduce task to produce the final result (stored in HDFs). Once the job is completed, the map output can be deleted.
The reduce task does not have the data localization advantage: the input of a single reduce task usually comes from the output of all mapper tasks. The output of the reduce task is typically stored in HDFS for reliable storage.
  Data Flow jobs vary according to the number of reduce tasks that are set, but the data flow is different, but similar. The number of reduce tasks is not determined by the size of the input data, but can be specified by manual configuration.
Single reduce task
  Multiple reduce tasks In the case of multiple reduce tasks, each map task partitions its output (partition), creating a partition for each reduce task . The partition has a user-defined partition function control, and the default partition (Partitioner) is partitioned by a hash function.
The data flow between the map task and the reduce task is called Shuffle (mixed-wash).

No reduce taskof course, there may be situations where you do not need to perform the reduce task, where the data can be completely parallel.

combiner (merge function)by the way, let's just say combiner. Hadoop runs the user to specify a merge function for the output of the map task, and the output of the merge function as input to the reduce function. In fact, the merging function is an optimization scheme, which is to reduce the amount of network traffic by executing the merge function (usually a copy of the reduce function) in the local computer after the map task executes.

Hadoop in the Big Data Era (iii): Hadoop Data Flow (life cycle)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Big Data era: a summary of knowledge points based on Microsof... 11-05

Big Data Architecture Development Mining Analytics Hadoop HBa... 04-28

Big Data Architecture Development Mining Analytics Hadoop HBa... 12-02

0 Basic Learning Cloud computing and Big Data DBA cluster Arc... 02-21

"Big Data dry" implementation of big data platform based on H... 10-21

MYSQL Big Data Import 12-08

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop in the Big Data Era (iii): Hadoop Data Flow (life cycle)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support