MapReduce Tutorial (1) Based on MapReduce Framework Development

Source: Internet
Author: User
Keywords mapreduce hadoop writable interface
Tags hadoop mapreduce writable interface big date storage map function
1 MapReduce programming
1.1 Introduction to MapReduce
MapReduce is a programming model for parallel computing of large data sets (greater than 1TB) to solve the computational problems of massive data.
MapReduce is divided into two parts:
1. Mapping applies the same operation to each target in the collection. That is, if you want to multiply each cell in the form by two, then the operation that applies this function to each cell individually is a mapping.
2. Reducing traverses the elements in the collection to return a comprehensive result. That is, the output of a column of numbers in the form and this task belong to reducing.

When you submit a calculation job to the MapReduce framework, it first splits the calculation job into several Map tasks, and then assigns them to different nodes for execution. Each Map task processes a part of the input data. When the Map task is completed, It will generate some intermediate files that will be used as input to the Reduce task.
The main goal of the Reduce task is to summarize the output of the previous Maps and output them.
The great thing about MapReduce is that programmers run their own programs on distributed systems without distributed parallel programming.

1.2 MapReduce Operation Principle

Everything starts with the top user program. The user program links the MapReduce library and implements the most basic Map functions and Reduce functions. The order in which the figures are executed is marked with numbers.

1. The MapReduce library first divides the input file of the user program into M shares (M is defined by the user). Each copy usually has 16MB to 64MB, which is divided into split0~4 as shown on the left side; then use fork to copy the user process. Go to other machines in the cluster.

2. There is one in the copy of the user program called master, the rest is called worker, the master is responsible for scheduling, assigning jobs for idle workers (Map job 3 or Reduce job), the number of workers can also be specified by the user.

3. The worker assigned the Map job starts to read the input data of the corresponding slice. The number of Map jobs is determined by M, and one-to-one correspondence with the split; the Map job extracts the key-value pairs from the input data, each key The value pairs are passed as parameters to the map function, and the intermediate key-value pairs generated by the map function are cached in memory.

4. The cached intermediate key-value pairs are periodically written to the local disk and are divided into R zones. The size of R is defined by the user. Each zone will correspond to a Reduce job in the future; the location of these intermediate key-value pairs Will be notified to the master, the master is responsible for forwarding information to the Reduce worker.

5, the master informs the staff assigned the Reduce job where it is responsible for the partition (certainly more than one place, the intermediate key value pairs generated by each Map job may be mapped to all R different partitions), when the Reduce worker puts all of it After the responsible intermediate key-value pairs are read, they are sorted first so that the key-value pairs of the same key are grouped together. Because different keys may map to the same partition, which is the same Reduce job (who makes the partition less), sorting is a must.

6. The reduce worker traverses the sorted intermediate key-value pair. For each unique key, the key and associated value are passed to the reduce function, and the output generated by the reduce function is added to the output file of the partition.

7. When all the Map and Reduce jobs are completed, the master wakes up the genuine user program, and the MapReduce function calls the code that returns the user program.

8. After all executions, the MapReduce output is placed in the output file of the R partitions (one for each Reduce job). Users usually do not need to merge these R files, but instead pass them to another MapReduce program as input. Throughout the process, the input data is from the underlying Distributed File System (GFS), the intermediate data is placed on the local file system, and the final output data is written to the underlying Distributed File System (GFS). Moreover, we should pay attention to the difference between the Map/Reduce job and the map/reduce function: the Map job processes a slice of input data, and may need to call multiple map functions to process each input key-value pair; the Reduce job processes the middle key of a partition. For value pairs, the reduce function is called once for each different key, and the Reduce job eventually corresponds to an output file.

1.3 Input and output
The Map/Reduce framework runs on the <key, value> key-value pair. That is, the framework treats the input of the job as a set of <key, value> key-value pairs, which also yields a set of <key, value> The key-value pair is used as the output of the job, and the types of the two sets of key-value pairs may be different.

The framework needs to serialize the keys and values of classes, so these classes need to implement the Writable interface. In addition, in order to facilitate the framework to perform sort operations, the key class must implement the WritableComparable interface.

The input and output types for a Map/Reduce job are as follows:
(input) <k1, v1> -> map -> <k2, v2>-> combine -> <k2, v2> -> reduce -> <k3, v3> (output)

1.4 Writable interface
The Writable interface is a serialized object that implements a serialization protocol.

Defining a structured object in Hadoop requires implementing a Writable interface so that the structured object can be serialized into a byte stream, and the byte stream can also be deserialized into a structured object.

Basic Java type
Basic Java type
Byte type
Floating point FloatWritable
Long integer
LongWritable 8
Long integer VLongWritable
Double precision floating point DoubleWritable
Text type corresponding Java string

2 MapReduce programming

2.1 Preparing data

1. In the /home directory, create a new words.txt file with the following contents:
Hello tom
Hello jerry
Hello kitty
Hello world
Hello tom
2. Upload to the hdfs file server/hadoop directory:
Execute the command: hadoop fs -put /home/words.txt /hadoop/words.txt
Execute the command: hadoop fs -cat /hadoop/words.txt

3 Generate JAR package
1. Select the hdfs project -> right-click menu -> Export... and select the JAR file under Java in the pop-up prompt box:

Export the jar name and path, select Next>:

Set the entry point of the program. When the settings are complete, click Finish:

Raw wc.jar the following file, as shown below:

2.4 Executing JAR results
1. Open the Xft software and upload the w:.jar of the D: drive to the Linux/home directory:

Excuting an order
Switch directory command: cd /home/
Execute the JAR package command: hadoop jar wc.jar
3, view the execution results
Execute the command: hadoop fs -ls /hadoop/wordsResult

Execute the command: hadoop fs -cat /hadoop/wordsResult/part-r-00000

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.