MapReduce Principles and Examples in Hadoop

Source: Internet
Author: User
Keywords mapreduce hadoop reduce function
Tags hadoop mapreduce writable interface big date storage map function
MapReduce is a programming model for data processing that is simple but powerful enough to be designed for parallel processing of big data.
1. MapReduce in general
MapReduce processing is divided into two steps: map and reduce. The input and output of each stage are in the form of key-value, and the types of key and value can be specified by themselves. The map stage performs parallel processing on the segmented data, and the processing results are transmitted to reduce, and the final summary is performed by the reduce function.

For example, to find the highest temperature in previous years from a large number of historical data, NCDC has disclosed the detection of all weather and other weather data in the past year. 

In order to use MapReduce to find the highest annual temperature in history, we use the number of rows as the key for the map input.

The bold parts in the figure above indicate the year and temperature, respectively. The map function processes each row of records and extracts key-value pairs in the form of (year, temperature) as the output of the map:
(1950,0)
(1950,22)
(1950, -11)
(1949,111)
(1947,78)

Obviously, some data is dirty, so map is also a good place for dirty data processing and filtering. Before the map output is transmitted to reduce, the MapReduce framework sorts the key-value pairs, groups them by key, and even calculates the highest temperature in the same group of keys, so the data format received by reduce is like this:
If there are multiple map tasks running at the same time (usually this is the case), after each map task is completed, the data in the above format will be sent to reduce. The process of sending data is called shuffle.

The entire MapReduce data flow is as follows:

img

The three black circles are map, shuffle and reduce processes. In Hadoop, map and reduce operations can be written in multiple languages, such as Java, Python, Ruby, and so on.

In actual distributed computing, the above process is coordinated by the entire cluster. We assume that there are 5 years (2011-2015) of weather data, distributed in 3 files: weather1.txt, weather2.txt, weather3.txt . Suppose we now have a cluster of 3 machines, b and the number of map task instances is 3 and the number of reduce instances is 2. Then when MapReduce is actually run as a job, the whole process is similar to this:
img

2. Examples and code implementation
MapReduce's conceptual framework was proposed by Google, and Hadoop provides a classic open source implementation. But it is not unique to Hadoop. For example, in the document database MongoDB, you can write Map-Reduce through JS to process the data in the database. Here we take Hadoop as an example.
data preparation
First upload the local file to HDFS.
You can check the management interface to see if the upload was successful.
Check the data content:

hadoop fs -text hdfs://master:9000/input/ncdc/sample.txt
Writing Java code
First implement the Mapper class. In the new version of Hadoop, Mapper is changed to a class (the old version was an interface).


// Supports generics. Generics define the key value types of map input and output.
public class Mapper <KEYIN, VALUEIN, KEYOUT, VALUEOUT> {
    public Mapper () {

  // called once at the start of the map task, for preparation
  protected void setup (Context context) throws IOException, InterruptedException {
    // empty implementation
  }

  // map logic directly converts input by type conversion and outputs
  protected void map (KEYIN key, VALUEIN value,
                     Context context) throws IOException, InterruptedException {
    context.write ((KEYOUT) key, (VALUEOUT) value);
  }
  // Called once after the task is over, cleaning up, corresponding to setup
  protected void cleanup (Context context
                         ) throws IOException, InterruptedException {
    // empty implementation
  }

  // The actual running process of the map is to call the run method, which is generally used for advanced implementation to control the task execution process more finely. Generally, this method does not need to be overridden.
  public void run (Context context) throws IOException, InterruptedException {
    // Ready to work
    setup (context);
    try {
      // Iterate over the data assigned to the task and call map in a loop
      while (context.nextKeyValue ()) {
        map (context.getCurrentKey (), context.getCurrentValue (), context);
      }
    } finally {
      // clean up
      cleanup (context);
    }
  }

}

In the implementation, we only cover the map method, and the others remain unchanged. The specific implementation is as follows:

public class MaxTemperatureMapper
  extends Mapper <LongWritable, Text, Text, IntWritable> {

  // 9999 represents data loss
  private static final int MISSING = 9999;

  @Override
  public void map (LongWritable key, Text value, Context context)
      throws IOException, InterruptedException {

    // row as input value key is not needed here for the time being
    String line = value.toString ();
    // Extraction year
    String year = line.substring (15, 19);
    // extract temperature
    int airTemperature = parseTemperature (line);
    String quality = line.substring (92, 93);

    // filter dirty data
    boolean isRecordClean = airTemperature! = MISSING && quality.matches ("[01459]");
    if (isRecordClean) {
      // output (year, temperature) pair
      context.write (new Text (year), new IntWritable (airTemperature));
    }
  }

  private int parseTemperature (String line) {
    int airTemperature;
    if (line.charAt (87) == '+') {// parseInt doesn't like leading plus signs
      airTemperature = Integer.parseInt (line.substring (88, 92));
    } else {
      airTemperature = Integer.parseInt (line.substring (87, 92));
    }

    return airTemperature;
  }
}

Our Reducer implementation is mainly to find the maximum temperature:

public class MaxTemperatureReducer
  extends Reducer<Text, IntWritable, Text, IntWritable> {

  @Override
  public void reduce(Text key, Iterable<IntWritable> values,
      Context context)
      throws IOException, InterruptedException {
    int maxValue = findMax( values );
    context.write(key, new IntWritable(maxValue));
  }

  private static  int findMax(Iterable<IntWritable> values){
    int maxValue = Integer.MIN_VALUE;
    for (IntWritable value : values) {
      maxValue = Math.max(maxValue, value.get());
    }

    return maxValue;
  }
}

The input and output paths are set using the static methods of FileInputFormat / FileOutputFormat. Before running the job, the output directory cannot exist. This is to avoid overwriting data and causing data loss. If the directory is detected before running, the job will not run. OK, package the project. If you use Eclipse, use the Export function. If you use Maven development, run the package command directly. Suppose our final jar package is max-temp.jar. Upload the jar package to your cluster machine, or put it on the client machine where Hadoop is installed. Here we assume that the jar package is placed in the / opt / job directory.
Run
First put the job jar package in the CLASSPATH to run.

hadoop will automatically add the path set by HADOOP_CLASSPAT to the CLASSPATH, at the same time add the HADOOP related dependencies to the CLASSPATH, and then start a JVM to run the MaxTemperature class with the main method.

In the log, you can see some running conditions of the job, such as the number of map tasks, the number of reduce tasks, and the number of input and output records. You can see that it exactly matches the actual situation.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.