Abstract: MapReduce is another core module of Hadoop. It understands MapReduce from three aspects: What MapReduce is, what MapReduce can do, and how MapReduce works.
Keywords: Hadoop MapReduce Distributed Processing
In the face of big data, the storage and processing of big data is like a person's right hand. It is particularly important. Hadoop is suitable for solving big data problems. It relies heavily on its big data storage system, HDFS and big data processing system, that is, MapReduce. For more information about HDFS, see the author's article Hadoop HDFS. For MapReduce, we will understand MapReduce from the following three questions.
Question 1: What is MapReduce?
Question 2: What Can MapReduce do?
Question 3: How does MapReduce work?
For the first question, we reference the introduction to MapReduce by Apache Foundation, "Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data (multi-terabyte data-sets) in-parallel on large clusters (thousands of nodes) of commodity hardware in a reliable, fault-tolerant manner. "From this we can see that Hadoop MapReduce is a software framework based on which applications can be easily written. These applications can run on a large set of thousands of commercial machines, it also processes terabytes of data sets in parallel in a reliable and fault-tolerant manner. This definition contains these keywords: Software Framework, parallel processing, reliability and fault tolerance, large-scale clusters, and massive data sets. Therefore, for MapReduce, we can simply think that it is a software framework, and massive data is its "Dish ", it "cooks this dish" concurrently in a reliable and fault-tolerant manner in a large-scale cluster ". Here, I sincerely lament the greatness of thoughts, the magic of decomposition, and the clever combination.
Knowing what MapReduce is and the second question is clear. What can MapReduce do? In short, big data processing is supported. The so-called big data processing, that is, value-oriented, for Big Data Processing, mining and optimization and other processing.
MapReduce is good at processing big data. Why does MapReduce have this capability? This can be found by the design idea of MapReduce. The idea of MapReduce is "divide and conquer ". Mapper is responsible for dividing complex tasks into several simple tasks. "Simple tasks" have three meanings: first, the data or computing scale is greatly reduced compared with the original task; second, the proximity calculation principle, that is, tasks are assigned to nodes that store the required data for computing. Third, these small tasks can be computed in parallel, with almost no dependencies between them. Reducer summarizes the results of the map stage. As for how many reducers are needed, you can set the value of the mapred. reduce. tasks parameter in the mapred-site.xml configuration file according to the specific problem, the default value is 1.
How does MapReduce handle big data? You can compile MapReduce applications to perform big data operations. Since MapReduce is used to process big data, how does MapReduce program work? This is the third problem, that is, the working mechanism of MapReduce.
Shows the entire process of MapReduce. It contains the following four independent entities.
Entity 1: client, used to submit MapReduce jobs.
Entity 2: jobtracker, used to coordinate the operation of a job.
Entity 3: tasktracker, used to process tasks after job division.
Entity 4: HDFS, used to share job files among other entities.
By reviewing the MapReduce workflow, we can see that the entire MapReduce work process includes the following steps in an orderly manner.
Step 1: Submit a job
Step 2: initialize a job
Step 3: Task Allocation
Step 4: execute a task
Step 5: update processes and statuses
Step 6: Homework completion
For details about what to do in each link, refer to Chapter 6 MapReduce working mechanism in the Hadoop authoritative guide. For more information, see.
If you want to use MapReduce to handle big data, you need to write MapReduce applications as needed. Therefore, using the MapReduce framework to develop programs requires in-depth consideration and constant practice.
Build a Hadoop environment on Ubuntu 13.04
Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1
Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)
Configuration of Hadoop environment in Ubuntu
Detailed tutorial on creating a Hadoop environment for standalone Edition
Build a Hadoop environment (using virtual machines to build two Ubuntu systems in a Winodws environment)