Initial knowledge of the Hadoop Developer Foundation Course

Source: Internet
Author: User
Tags new set hadoop mapreduce

With the rapid rise of Hadoop in the country, MapReduce has gradually attracted the attention of developers, as the core of Hadoop, let's see how it works.

First, what is MapReduce?

MapReduce is a programming model for parallel operations of large datasets (larger than 1TB). The concept of "map" and "Reduce" are their main ideas, borrowed from functional programming languages, and borrowed from vector programming language features. "." It is greatly convenient for programmers to run their own programs on distributed systems without distributed parallel programming. The current software implementation is to specify a map function that maps a set of key-value pairs into a new set of key-value pairs, specifying the concurrent reduce (return) function, which is used to guarantee that each of the mapped key-value pairs share the same set of keys.

To be very abstract, let's look at the location of MapReduce in the Hadoop project first.

650) this.width=650; "alt=" MapReduce Learning "class=" Img-thumbnail "src=" http://image.evget.com/images/article/2015/ Mapreduce01.jpg "/>

Hadoop is actually the open source implementation of Google Sambo, Hadoop MapReduce corresponding to the Google mapreduce,hbase Bigtable,hdfs corresponding gfs. HDFS (or GFS) provides efficient unstructured storage services for the upper layer, and HBase (or bigtable) is a distributed database that provides structured data services, and Hadoop MapReduce (or Google MapReduce) is a programming model for parallel computing. Used for job scheduling.

In a nutshell, MapReduce is the framework for splitting a large job into multiple small jobs (big jobs and small jobs should be essentially the same, just in scale), and what users need to do is decide how many to split and define the job itself.

Map function and reduce function

The map function and the reduce function are given to the user, and the two functions define the task itself.

Map function: Accepts a key-value pair (Key-value pair), producing a set of intermediate key-value pairs. The MapReduce framework passes the same value of the middle key value generated by the map function to a reduce function.

Reduce function: Accepts a key, together with a related set of values, that merges the set of values to produce a smaller set of values (usually with only one or 0 values).

The core code of the MapReduce function that counts the word frequency is very brief, mainly to implement these two functions.

map (string key, string value):       //  key: document name      // value: document  contents      for each word w in value:           emitintermediate (w,  "1");     reduce ( String key, iterator values):       // key: a word       // values: a list of counts       int result = 0;      for each v  In values:          result += parseint (v);           emit (asstring (Result)); 

In the statistical frequency of the example, the map function accepts the key is the file name, the value is the contents of the document, map one by one, each encountered a word w, produces an intermediate key value to <w, "1"; this means the word w we found another MapReduce passes key-value pairs of the same key (both word w) to the reduce function, so that the key that the reduce function accepts is the word w, the value is a string of "1" (the basic implementation is this, but can be optimized), the number equals the number of key-value pairs with the key W, and then these & ldquo;1&rdquo; accumulate to get the number of occurrences of the word W. Finally, the number of occurrences of these words will be written to the user-defined location, stored in the underlying distributed storage System (GFS or HDFS).

Iii. working principle of MapReduce

650) this.width=650; "alt=" MapReduce Learning "class=" Img-thumbnail "src=" http://image.evget.com/images/article/2015/ Mapreduce02.jpg "/>

is the flowchart given in the paper. Everything starts at the top of the user program, and the user program links the MapReduce library and implements the most basic map and reduce functions. The order of execution in the figure is marked with a number.

1. The MapReduce library first divides the input file of user program into m parts (m is user defined), each part usually has 16MB to 64MB, the left side is divided into split0~4; then use fork to copy the user process to the other machines in the cluster.

2, a copy of user program is called Master, the rest is called Worker,master is responsible for scheduling, for the Idle worker assignment (map job or reduce job), the number of workers can also be specified by the user.

3. The worker assigned the map job begins to read the input data of the corresponding Shard, the number of map jobs is determined by M, and the split one by one corresponds; the map job extracts key-value pairs from the input data, and each key-value pair is passed as a parameter to the map function. The intermediate key-value pairs produced by the map function are cached in memory.

4, the middle key value pair of the cache will be written to the local disk periodically, and is divided into R zones, the size of R is user-defined, in the future each area will correspond to a reduce job; the location of these intermediate key-value pairs is communicated to master,master responsible for forwarding the information to the reduce worker.

5. Master informs the worker of the reduce job that it is responsible for where the partition is located (certainly more than one place, the intermediate key-value pairs produced by each map job may map to all r different partitions), and when the reduce worker reads all of the intermediate key-value pairs that it is responsible for. , they are sorted so that key-value pairs of the same key are clustered together. Sorting is necessary because different keys may map to the same partition, which is the same reduce job (who makes the partition less).

6, the reduce worker traverses the sorted middle key value pair, for each unique key, the key and the associated value is passed to the reduce function, the output generated by the reduce function is added to the output file of this partition.

7. When all the map and reduce jobs are completed, the user Program,mapreduce function Call of Master wakeup genuine returns the code of user program.

After all execution, the MapReduce output is placed in the output file of the R partition (one for each reduce job). Users usually do not need to merge the R files, but instead give them as input to another MapReduce program. Throughout the process, the input data comes from the underlying Distributed File System (GFS), where the intermediate data is placed on the local file system, and the final output data is written to the underlying Distributed File System (GFS). And we have to pay attention to the difference between the map/reduce job and the Map/reduce function: The map job processes a shard of input data and may need to call multiple map functions to handle each input key value pair; the reduce job processes a partition's middle key-value pair, The reduce function is called once for each different key, and the reduce job eventually corresponds to an output file.

This article refers to from Phoenix Reprint please indicate the article reproduced from: HPE Control network

Initial knowledge of the Hadoop Developer Foundation Course

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.