Hadoop Series 4: MapReduce advanced

Source: Internet
Author: User

1. mapper and reducer

MapReduce processes data in two stages: map stage and reduce stage. The two stages are completed by the user-developed map function and reduce function, they are also called mapper and reducer respectively. Key-value pairs(Key-value pair) is the basic data structure of MapReduce. The data read and output by mapper and reducer are key-value pairs. In MapReduce, keys and values can be basic data types, such as integers, floating-point numbers, strings, or unprocessed bytes, or complex data types in any form. Programmers can define the data types they need, or do this easily by using Protocol Buffer, Thrift, or Avro. One of the tasks of MapReduce algorithm design is to define the "key-value" Data Structure on a given dataset. For example, in the search engine's collection and storage of web pages, keys can be expressed using URLs, the value is the content of the webpage. In some algorithms, keys can also be data without any practical significance, which can be safely ignored during data processing. In MapReduce, programmers need to define mapper and reducer: map: (k1, v1) --> [(k2, v2)] reduce: (k2, [v2]) in the following way. --> [(k3, v3)] [...] it may be a list. The data passed to MapReduce for processing can be stored in a distributed file system, the ER er Operation will apply to each passed "key-value" pair and generate a certain number of "intermediate key-value )", then, the reduce operation applies to these intermediate key-value pairs and outputs the final key-value pairs after processing. In addition, the mapper operation and CER operation also imply a "group" Operation applied to the intermediate key-value pair, the key-value pairs of the same key need to be classified into the same group and sent to the same CER. The key-value pairs in the Group sent to each reducer are sorted by the key. The results generated by CER are saved to the distributed file system and stored as one or more files ending with the r (reducer number, however, the intermediate key-value pair data generated by mapper is not saved.

 

650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131228/02540T2O-0.jpg "border =" 0 "alt =" "/>

During Big Data Processing, MapReduce splits the data into multiple parts based on the data file to be processed and the map function compiled by the user ), then, start a map task (map task) for each split. These map tasks are scheduled by the MapReduce running environment and run on one or more nodes in the cluster; after each er is executed, many key-value pairs may be output, which are called intermediate key-value pairs. These intermediate key-value pairs are temporarily stored in a certain location, until all mappers are executed, MapReduce splits these intermediate key-value pairs into one or more groups. The criteria for grouping are: All key-value pairs with the same keys must be sorted and placed in the same group. The same group can contain one or more keys and their corresponding data.The MapReduce runtime environment starts a reduce task for each group. These reduce tasks are scheduled by the MapReduce runtime environment to run on one or more nodes in the cluster. In fact, the function of an intermediate key-value pair group is called PartitionerThe dedicated component is responsible for this, which will be further elaborated later.

650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131228/02540S558-1.jpg "border =" 0 "alt =" "/>
MapReduce data flow of a single reduce task
Image Source: hadoop the definitive guide 3rd edition

650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131228/02540WC7-2.jpg "border =" 0 "alt =" "/>

MapReduce data flow for multiple reduce tasks
Image Source: hadoop the definitive guide 3rd edition

Mapper and reducer can directly perform the required operations on the data they receive. However, when using external resources, resource competition may occur between multiple mapper or reducer instances, this will inevitably lead to lower performance. Therefore, programmers must pay attention to the competition conditions of the resources they use and add appropriate processing. Second, the intermediate key-value pairs output by mapper can be different from the accepted key-value pairs. Similarly, the key-value pairs output by the reducer and the intermediate key-value pairs received can also be different data types, which may cause difficulties in programming and troubleshooting during program running, however, this is exactly one of the powerful functions of MapReduce.

In addition to the conventional two-phase MapReduce processing stream, there are also some forms of change. For example, you can save the mapper output results directly to the disk (each mapper corresponds to a file) for MapReduce jobs without reducers. However, only CER jobs and jobs without mapper are not allowed. However, even if reducer is not needed to process specific operations, reducer can be used to regroup and sort mapper output results, and then output in a complete MapReduce mode in another form.

650) this. width = 650; "src =" http://www.bkjia.com/uploads/allimg/131228/02540S0Z-3.jpg "border =" 0 "alt =" "/>

MapReduce jobs without reducers
Image Source: hadoop the definitive guide 3rd edition

MapReduce jobs generally read and save data through HDFS, but they can also use other data sources or data storage that meet the needs of MapReduce applications, for example, Google MapReduce uses Bigtable to read or output data. BigTable is a non-relational database. It is a sparse, distributed, and persistent multi-dimensional sorting Map. It is designed to reliably process petabytes of data, and can be deployed on thousands of machines. A similar implementation of HBase in Hadoop can be used to provide data sources and data storage for MapReduce. These contents will be described in detail later.


References:
Data-Intensive Text Processing with MapReduce
Hadoop The Definitive Guide 3rd edtion
Apache Hadoop Documentation

This article is from the "Marco Education" blog. For more information, contact the author!

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.