Hadoop In The Big Data era (III): hadoop data stream (lifecycle)

Source: Internet
Author: User

 

Hadoop In The Big Data era (1): hadoop Installation

Hadoop In The Big Data era (II): hadoop script Parsing


 


To understand hadoop, you first need to understand hadoop data streams, just like learning about the servlet lifecycle.Hadoop is a distributed storage (HDFS) and distributed computing framework (mapreduce)But hadoop also has an important feature:Hadoop will move mapreduce computing to different machines that store part of the data..



Terms
A mapreduce job is a unit of work that the client needs to execute. It includes input data, mapreduce programs, and configuration information. Hadoop divides jobs into several small tasks for execution, including two types of tasks: Map and reduce tasks.


There are two types of nodes that control the job execution process: One jobtracker and a series of tasktracker. Jobtracker schedules tasks running on tasktracker to coordinate all jobs running on the system. Tasktracker sends the running Progress Report to jobtracker while running the task. jobtracker records the overall progress of each job task. If one task fails, jobtracker can reschedule the task on another tasktracker node.

Input
Hadoop divides the input data of mapreduce into small data blocks with an equal length. Input split. Hadoop creates a map task for each shard.And the task runs the User-Defined map function to process each record in the shard.
For most jobs, A reasonable part size tends to be the size of one HDFS block. The default value is 64 MB.But you can adjust this default value for the cluster. The part size must be determined based on the running task. If the part size is too small, the total time for managing the part and the total time for building the map task will determine the job execution time.

Hadoop runs a map task on a node that stores input data to achieve optimal performance. This is called Data localization Optimization. Because a block is the smallest unit of data stored in HDFS, each block can exist on multiple nodes at the same time (Backup). Each block that a file is divided into is randomly divided on multiple nodes, therefore, if the input parts of a map task span multiple data blocks, basically no node can have these consecutive data blocks at the same time, therefore, the map task needs to remotely copy data blocks that do not exist on this node to the current node through the network and then run the map function. Therefore, this task is obviously very inefficient.


Output The map task writes its output to the local disk, instead of HDFS.. This is because the output of map is the intermediate result: the intermediate result is generated after the reduce task is processed (stored in HDFS ). Once the job is completed, the map output result can be deleted.
Reduce tasks do not have the advantage of Data Localization: the input of a single reduce task usually comes from the output of all mapper tasks. The output of reduce tasks is usually stored in HDFS for reliable storage.


Data Stream
The data flow varies depending on the number of reduce tasks. The number of reduce tasks is not determined by the size of input data, but can be specified by manual configuration.

Single reduce task


Multiple reduce tasks
For multiple reduce tasks Each map task creates a partition for each reduce task.. Partitions are controlled by user-defined partition functions. The default partition Er (partitioner) Partitions through the hash function.
The data flow between a map task and a reduce task is called Shuffle).



If there is no reduce task, there may also be no need to execute reduce tasks, that is, data can be completely parallel.




Combiner (Merge function) By the way, combiner. When hadoop runs a user, it specifies a Merge function for the output of the map task. The output of the Merge function is used as the input of the reduce function. In fact, the Merge function is an optimization solution. To put it bluntly, the Merge function is executed on the local machine after the map task is executed (usually the copy of the reduce function) to reduce the amount of network transmission.


Hadoop In The Big Data era (III): hadoop data stream (lifecycle)

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.