Topic Center

Contact Sales

Home > Hot Categories > Big Data

Hadoop In The Big Data era (III): hadoop data stream (lifecycle)

Last Update:2014-10-15 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hadoop In The Big Data era (1): hadoop Installation

Hadoop In The Big Data era (II): hadoop script Parsing

To understand hadoop, you first need to understand hadoop data streams, just like learning about the servlet lifecycle.Hadoop is a distributed storage (HDFS) and distributed computing framework (mapreduce)But hadoop also has an important feature:Hadoop will move mapreduce computing to different machines that store part of the data..

Terms
A mapreduce job is a unit of work that the client needs to execute. It includes input data, mapreduce programs, and configuration information. Hadoop divides jobs into several small tasks for execution, including two types of tasks: Map and reduce tasks.

There are two types of nodes that control the job execution process: One jobtracker and a series of tasktracker. Jobtracker schedules tasks running on tasktracker to coordinate all jobs running on the system. Tasktracker sends the running Progress Report to jobtracker while running the task. jobtracker records the overall progress of each job task. If one task fails, jobtracker can reschedule the task on another tasktracker node.

Input
Hadoop divides the input data of mapreduce into small data blocks with an equal length. Input split. Hadoop creates a map task for each shard.And the task runs the User-Defined map function to process each record in the shard.
For most jobs, A reasonable part size tends to be the size of one HDFS block. The default value is 64 MB.But you can adjust this default value for the cluster. The part size must be determined based on the running task. If the part size is too small, the total time for managing the part and the total time for building the map task will determine the job execution time.

Hadoop runs a map task on a node that stores input data to achieve optimal performance. This is called Data localization Optimization. Because a block is the smallest unit of data stored in HDFS, each block can exist on multiple nodes at the same time (Backup). Each block that a file is divided into is randomly divided on multiple nodes, therefore, if the input parts of a map task span multiple data blocks, basically no node can have these consecutive data blocks at the same time, therefore, the map task needs to remotely copy data blocks that do not exist on this node to the current node through the network and then run the map function. Therefore, this task is obviously very inefficient.

Output The map task writes its output to the local disk, instead of HDFS.. This is because the output of map is the intermediate result: the intermediate result is generated after the reduce task is processed (stored in HDFS ). Once the job is completed, the map output result can be deleted.
Reduce tasks do not have the advantage of Data Localization: the input of a single reduce task usually comes from the output of all mapper tasks. The output of reduce tasks is usually stored in HDFS for reliable storage.

Data Stream
The data flow varies depending on the number of reduce tasks. The number of reduce tasks is not determined by the size of input data, but can be specified by manual configuration.

Single reduce task

Multiple reduce tasks
For multiple reduce tasks Each map task creates a partition for each reduce task.. Partitions are controlled by user-defined partition functions. The default partition Er (partitioner) Partitions through the hash function.
The data flow between a map task and a reduce task is called Shuffle).

If there is no reduce task, there may also be no need to execute reduce tasks, that is, data can be completely parallel.

Combiner (Merge function) By the way, combiner. When hadoop runs a user, it specifies a Merge function for the output of the map task. The output of the Merge function is used as the input of the reduce function. In fact, the Merge function is an optimization solution. To put it bluntly, the Merge function is executed on the local machine after the map task is executed (usually the copy of the reduce function) to reduce the amount of network transmission.

Hadoop In The Big Data era (III): hadoop data stream (lifecycle)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Big Data era: a summary of knowledge points based on Microsof... 11-05

Big Data Architecture Development Mining Analytics Hadoop HBa... 04-28

Big Data Architecture Development Mining Analytics Hadoop HBa... 12-02

0 Basic Learning Cloud computing and Big Data DBA cluster Arc... 02-21

"Big Data dry" implementation of big data platform based on H... 10-21

MYSQL Big Data Import 12-08

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

What's Trending

Top 10 Tags

datastax versions naming convention zookeeper client class definition md5 microsoft sql server 2005 data structures exception handling error handling

Top 10 Keywords

microsoft download center down wordpress address url site address url wordpress address url windows installer 4 0 download 302 not found web address url definition site address url wordpress db2 integer mac os installation step by step pdf abbreviation for return

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hadoop In The Big Data era (III): hadoop data stream (lifecycle)

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support