Hadoop In The Big Data era (1): hadoop Installation
Hadoop In The Big Data era (II): hadoop script Parsing
To understand hadoop, you first need to understand hadoop data streams, just like learning about the servlet lifecycle.Hadoop is a distributed storage (HDFS) and distributed computing framework (mapreduce)But hadoop also has an important feature:Hadoop will move mapreduce computing to different machines that store part of the data..
A mapreduce job is a unit of work that the client needs to execute. It includes input data, mapreduce programs, and configuration information. Hadoop divides jobs into several small tasks for execution, including two types of tasks:
Map and reduce tasks
There are two types of nodes that control the job execution process:
One jobtracker and a series of tasktracker
. Jobtracker schedules tasks running on tasktracker to coordinate all jobs running on the system. Tasktracker sends the running Progress Report to jobtracker while running the task. jobtracker records the overall progress of each job task. If one task fails, jobtracker can reschedule the task on another tasktracker node.
Hadoop divides the input data of mapreduce into small data blocks with an equal length.
Hadoop creates a map task for each shard.
And the task runs the User-Defined map function to process each record in the shard.
For most jobs,
A reasonable part size tends to be the size of one HDFS block. The default value is 64 MB.
But you can adjust this default value for the cluster. The part size must be determined based on the running task. If the part size is too small, the total time for managing the part and the total time for building the map task will determine the job execution time.
Hadoop runs a map task on a node that stores input data to achieve optimal performance. This is called
Data localization Optimization
. Because a block is the smallest unit of data stored in HDFS, each block can exist on multiple nodes at the same time (Backup). Each block that a file is divided into is randomly divided on multiple nodes, therefore, if the input parts of a map task span multiple data blocks, basically no node can have these consecutive data blocks at the same time, therefore, the map task needs to remotely copy data blocks that do not exist on this node to the current node through the network and then run the map function. Therefore, this task is obviously very inefficient.
The map task writes its output to the local disk, instead of HDFS.
. This is because the output of map is the intermediate result: the intermediate result is generated after the reduce task is processed (stored in HDFS ). Once the job is completed, the map output result can be deleted.
Reduce tasks do not have the advantage of Data Localization: the input of a single reduce task usually comes from the output of all mapper tasks.
The output of reduce tasks is usually stored in HDFS for reliable storage.
The data flow varies depending on the number of reduce tasks. The number of reduce tasks is not determined by the size of input data, but can be specified by manual configuration.
Single reduce task
Multiple reduce tasks
For multiple reduce tasks
Each map task creates a partition for each reduce task.
. Partitions are controlled by user-defined partition functions. The default partition Er (partitioner) Partitions through the hash function.
The data flow between a map task and a reduce task is called
If there is no reduce task, there may also be no need to execute reduce tasks, that is, data can be completely parallel.
Combiner (Merge function) By the way, combiner. When hadoop runs a user, it specifies a Merge function for the output of the map task. The output of the Merge function is used as the input of the reduce function. In fact, the Merge function is an optimization solution. To put it bluntly, the Merge function is executed on the local machine after the map task is executed (usually the copy of the reduce function) to reduce the amount of network transmission.
Hadoop In The Big Data era (III): hadoop data stream (lifecycle)