Hadoop Learning - MapReduce Principle and Operation Process

Source: Internet
Author: User
Keywords mapreduce hadoop map function
Tags hadoop mapreduce writable interface big date storage map function
Earlier we used HDFS for related operations, and also learned the principles and mechanisms of HDFS. With a distributed file system, how do we handle files? This is the second component of Hadoop-MapReduce.

MapReduce fully draws on the idea of divide and conquer, and divides a data processing process into two steps: Map (map) and Reduce (processing). Then the user only needs to hand the data to the reduce function in the required format to easily implement distributed computing. A lot of work is encapsulated for us by the mapReduce framework, which greatly simplifies the operation process.

1 MapReduce programming ideas

MapReduce's design idea comes from mapping and simplification operations in LISP and other functional programming languages. The smallest unit of data manipulation is a key-value pair. When users use the MapReduce programming model, the first step is to convert the data into key-value pairs. The map function takes the key-value pairs as input, and after processing by the map function, new key-value pairs are generated as intermediate results. The MapReduce calculation framework will automatically aggregate these intermediate result data, and then distribute the data with the same key to the reduce function for processing. The reduce function outputs the processing result in the form of key-value pairs. To express it in the form of an expression, it is roughly as follows:

2 MapReduce operating environment
Similar to HDFS, the MapReduce calculation box is also a master-slave architecture. Supporting the MapReduce computing framework are two types of background processes, JobTracker and TaskTracker.
2.1 JobTracker
Job Tracker plays the main role in the cluster. It is mainly responsible for the two functions of task scheduling and cluster resource monitoring, but does not participate in specific calculations. A Hadoop cluster has only one JobTracker. There is a possibility of a single point of failure, so it must be run on a relatively reliable node. An error in a JobTracker will cause all running tasks in the cluster to fail.

Similar to HDFS's NameNode and DataNode, TaskTracker will also report the current health status and status to JobTracker through a periodic heartbeat. The heartbeat information includes information about its own computing resources, information about occupied computing resources, and running tasks. Status information. JobTracker will consider TaskTracker's resource remaining amount, job priority, job submission time and other factors based on the heartbeat information periodically sent by each TaskTracker to assign a suitable task to TaskTracker.

2.2 TaskTracker
TaskTracker plays the role of slave in the cluster. It is mainly responsible for reporting the heartbeat and executing JobTracker commands. A cluster can have multiple TaskTrackers, but a node will only have one TaskTracker, and TaskTracker and DataNode run on the same node. In this way, a node is both a compute node and a storage node. TaskTracker will periodically report various information to JobTracker, and JobTracker receives heartbeat information, and will issue commands to the TaskTracker based on the heartbeat information and the current job running conditions, mainly including starting tasks, submitting tasks, killing tasks, killing jobs And reinitialize 5 commands.

2.3 Client
The user submits the written MapReduce program to the JobTracker through the client.

3 MapReduce jobs and tasks
The Map Reduce job is the smallest unit submitted by the user, and the Map / Reduce task is MapReduce
The smallest unit of calculation. When a user submits a MapReduce job to Hadoop, JobTracker's job decomposition module will split it into
Tasks are executed by each TaskTracker. In the MapReduce computing framework, tasks are divided into two types-Map tasks and Reduce tasks.
4 MapReduce computing resource partitioning
The calculation of a MapReduce job is done by TaskTracker. When a user submits a job to Hadoop, the Job Tracker will split the job into multiple tasks and send it to the idle TaskTracker based on the heartbeat information. The number of tasks that a Task Tracker can start is determined by the task slots configured by TaskTracker. Slots are a representation model of Hadoop's computing resources. Hadoop abstracts multi-dimensional resources (CPU, memory, etc.) on each node into one-dimensional slots. This translates multi-dimensional resource allocation problems into one-dimensional slot allocation problems. In actual situations, Map tasks and Reduce tasks require different computing resources. Hadoop divides slots into Map slots and Reduce slots. Map tasks can only use Map slots, and Reduce tasks can only use Reduce slots.
Hadoop's resource management uses a static resource setting scheme, that is, the number of Map slots and Reduce slots is configured for each node (the configuration item is mapred.task: tracker.map.tasks.maximum and mapred.taskTracker of mapred-site.xml. reduce.tasks.maximum), once Hadoop is started, it cannot be changed dynamically.

5 Limitations of MapReduce
From the characteristics of MapReduce, it can be seen that the advantages of MapReduce are very obvious, but MapReduce also has its limitations. Well is not a universal method for processing massive data. Its limitations are mainly reflected in the following points:
MapReduce is slow to execute. An ordinary MapReduce job is usually completed at the minute level. In the case of a complex job or a larger amount of data, it may take an hour or more. Fortunately, offline computing is far less sensitive than OLTP. So MapReduce is not now and will not be the terminator of relational databases in the future. The slowness of MapReduce is mainly due to disk 1/0. MapReduce jobs are usually data-intensive jobs. A large number of intermediate results need to be written to disk and transmitted through the network, which consumes a lot of time.

MapReduce is too low-level. Compared to SQL, MapReduce is too low-level. For ordinary queries, the average person would not want to write a map function and reduce function. For users who are accustomed to relational databases or data analysts, writing map functions and reduce functions is undoubtedly a headache. Fortunately, the emergence of Hive has greatly improved this situation.

Not all algorithms can be implemented with MapReduce. This means that not all algorithms can be implemented in parallel. For example, for machine learning model training, these algorithms require state sharing or dependencies between parameters, and they need to be centrally maintained and updated.
6 Come to a Hello world
This time hello world is a word count program. Use this small program to understand the usage of MapReduce.

Mapper class:

import java.io.IOException;
import java.util.StringTokenizer;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;
public class TokenizerMapper extends Mapper <Object, Text, Text, IntWritable> {
    // An IntWritable variable is used for counting. For each key, give it a value of value = 1
    IntWritable one = new IntWritable (1);
    Text word = new Text ();
    public void map (Object key, Text value, Context context) throws IOException, InterruptedException {
        // The value read by Hadoop is in behavior units, and its key is the line number corresponding to the line, because we want to calculate the number of each word,
        // By default, spaces are used as the interval, so StringTokenizer is used to assist in string splitting. You can also use string.split ("").
        StringTokenizer itr = new StringTokenizer (value.toString ());
        // traverse every word
        while (itr.hasMoreTokens ()) {
            word.set (itr.nextToken ());
            context.write (word, one);

The above is the Map breaking process.

Above we saw the data type IntWritable, where Text is equivalent to the String IntWritable in jdk and the int type of jdk, and hadoop has 8 basic data types, all of which implement the WritableComparable interface:

Standard Boolean value
Single-byte value
Floating point
Long integer
Text stored in UTF8 format
Used when the key or value in "key, value" is empty

Next, write the reduce class.

Here is the main function.

Then it is a jar package. I use idea. The packaging process is not listed. Baidu. The jar package I played was named: hadoop.jar. After finishing playing, upload the jar package to the server.
With your hadoop service turned on, enter the following command:

#hadoop jar hadoop.jar cn.edu.hust.demo1.WordCount /user/input /user/output
The middle is the path of the main program. The first path behind is the path of your word document. I uploaded the word document to / user / input, and the path of / user / output is the output path. This path cannot exist before executing the above command, otherwise an error will be reported.
After waiting for a while, if the above output appears, the operation is successful.

Here is the content of the output path:

The content of the txt word file that I uploaded is:
hello world big
hello world xiaoming
the world is beautiful
xiaoming is my friend
The word statistics after running the program are:

This completes our word statistics.
Let's talk about the operation of MapReduce by word frequency statistics.
7 In-depth understanding of the running process of MapReduce
As can be seen from the previous WordCount, a MapReduce job passes through five stages of input, map, combine, reduce, and output. The combine stage does not necessarily occur. The process of distributing the intermediate output of the map to the reducer is called shuffle ( Data shuffle).

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.