MapReduce Programming (introductory article) __ Programming

Source: Internet
Author: User
Tags split
A. MapReduce programming model

Or a classic picture to illustrate the problem.

1. First of all, we can make sure that we have an input, and that he has a large amount of data.

2. After split, he becomes a number of shards, each given a map processing 3. When the map is processed, Tasktracker will copy and sort the data, then divide the partition by the output key and value, and merge the partition same map output into the same reduce input. 4. Ruducer through processing, the data output, each of the same key, must be processed in a reduce, each reduce at least one copy of the output (can be extended multipleoutputformat to get multi-output) 5. Take a look at an example, as shown in the following diagram: (From the "authoritative Guide to Hadoop"): 5.1 The input data may be a bunch of text 5.2 mapper parses each row of data and extracts valid data as output. The example here is to extract the daily temperature of each year from the log file, and finally calculate the maximum temperature per year 5.3 The output of the map is a single key-value 5.4 through shuffle, the input of reduce, which is the same key corresponding value is combined into a Iterators the task of 5.5 reduce is to extract the highest temperature per year and then outputtwo. Mapper1. Mapper can selectively inherit the base class of mapreducebase, he just implements some methods, even if the method body is empty. 2. Mapper must implement the Mapper interface (previous version 0.20), which is a generic interface that needs to perform input and output key-value types, which are usually the implementation Class 3 of the Wriable interface. To implement the map method, the method has four parameters, the first two is the input Key and value, the third parameter is Ouputcollector, for the collection of output, the fourth is reporter, used to report some states, can be used for debug 3.1 input default is a Line a record, each day the record is placed in the value inside 3.2 output each time a k-v record is collected, a K can correspond to multiple value, which is embodied as a iterator 4 in reduce. The overlay configure method can get an instance of jobconf, which is passed at the job runtime and can interact with external resourcesthree. Reducer1. Reduce can also choose to inherit the base class Mapreducebase, which functions like mapper. 2. The reducer must implement the Reducer interface, which is also a generic interface with a meaning similar to that of Mapper 3. To implement the reduce method, this method also has four parameters, the first is the input key, the second is the input of the value of the iterator, you can traverse all the value, the equivalent of a list, Outputcollector is the same as the map, is the output of the collector, Each collection is a form of key-value, and the report works the same as map. 4. In the new version, Hadoop has merged the next two parameters into a context object and, of course, is compatible with the version of the interface. >0.19.x 5. Overwrite the Configure method, which works the same as map 6. The Close method can be overridden to do some processing after the reduce is completed. (Clean up)four. Combiner1. The function of combiner is to calculate the output of the map first, get the initial merging results and reduce the calculation pressure of reduce. 2. Combiner is written in the same way as reduce, which is a reducer implementation Class 3. When reducer conforms to function f (A, B) = f (f (a), F (b)), Combinner can be the same as reduce. such as SUM (a,b,c,d,e,f,g) = SUM (sum (A, B), sum (c,d,e,f), SUM (g)) and Max, Min, and so on. 4. Writing the correct combiner can optimize the performance of the entire MapReduce program. (especially when reduce is a performance bottleneck.) 5. Combiner can be different from reducer.Five. Configuration1. The value of the post-add property overrides the value of the property with the same name as defined earlier. 2. Attributes defined as final (plus <final>true</final> tags in attribute definitions) are not overwritten by values defined by the property with the same name as the following. 3. System attributes are higher precedence than those defined by the resource, that is, the values of the attributes defined in the resource file are overwritten by the System.setproperty () method. 4. The system attribute definition must have a corresponding definition in the resource file in order to take effect. 5. Properties defined by the-D option are higher than the attribute defined in the resource file.six. Run Jobs1. Set Inputs & Output 1.1 First to determine if the input is present (does not exist will cause an error, it is best to use the program to determine.) 1.2 Determine if the output is already present (it can also cause an error) 1.3 Develop a good habit (first judgment, then execute) 2. Set mapper, reducer, combiner.  The class object for each implementation class. Xxxx.class 3. There are two types of settings InputFormat & OutputFormat & Types 3.1 input and output format, one is textfile and the other is sequencefile.     Simply understood, textfile is the form of a text organization, and sequence file is the form of a binary organization. The 3.2 types settings, based on the input and output data types, set the class object of the implementation classes for various writable interfaces.     4. Set the reduce Count 4.1 reduce count to 0 when your data does not need to reduce. The number of 4.2 reduce is better than the number of slots currently available, so that reduce can be calculated in one wave. (a slot can be understood as a compute unit (Resource).)Seven. Some other details. 1. chainmapper can implement chained execution mapper He is a mapper implementation class in itself. Provides a way to addmapper. 2. chainreducer, like Chainmapper, can implement a chained execution reducer, which is the reducer implementation class. 3.  multiple jobs successively, you can implement Jobclient.runjob method to achieve sequential 4.  expansion Multipleoutputformat interface, can achieve a reduce corresponding to multiple output ( And you can specify a file name OH) The 5. partitioner interface is used to partition the output of the map, and the data corresponding to the same key partition will be provided with an interface method for the same reducer processing     5.1:  The public int getpartition (K2 key, V2 value, int numreducetasks)     5.2 can be defined by itself, according to certain specific terms of key, or according to some of the characteristics of value can be divided.     5.3 Numreducetasks is the number of reduce that is set. Generally, the value of the partition returned should be less than this value. (%) 6. Role of reporter     6.1 reporter.incrcounter (key, amount). For example, data calculation is, some substandard dirty data, we can use counter to record how many     6.2 reporter.setstatus (status); The method can set a status message when we find out that the job is running, stating that we have expected (wrong or correct) conditions for debug.     6.3 reporter.progress (), like the MapReduce framework, reports the current running progress. This progress can play the role of Heartbeat. If a task is more than 10 minutes without a MapReduce framework report, this reduce will be killed. When your tasks are older, it's best to report your status to MapReduce on a regular basis. 7.  by implementing the Wriable interface, we can customize the key and ValThe type of UE, used like Pojo, does not need to parse every time. If your custom type is a type of key, you need to implement the comparable interface at the same time for sorting. For example, Mapwritable is an example.Eight. Actual combat. (Simple article)

Simple article:

1. Requirements: Statistics a site of PV per day

2. Data entry: Log data stored in days as a partition, a log representing a PV

3. Data output: Date PV

4. Written by mapper

The main job is simple, split each log, take out the date, and collect a record of the PV for that date, record the value of one (1, one record represents a PV)

5. Written by Reducer

The task of reduce is to summarize (sum) the logs of the same day (the same as key), and then output the summary results as a key in days.

6. Setting up the environment, specifying the job (Run)

6.1 Set the input path.

6.2 Setting the Output path

6.3 Setting data formats and data types for mapper/reducer and input data

6.4 Execute Command:

Hadoop jar Site-pv-job.jar Org.jiacheo.SitePVSumSampleJob

6.5 View Hadoop Web tools to show the current job progress.

As you can see, this input produces 14,292 maps, and 29 reduce. The number of reduce is so small because my reduce slots number is only 30, so set to 29, in case one hangs, but also in a wave of reduce to calculate good.

6.6 Calculation results.

The above section is the progress of the Hadoop CLI client display, in the middle is the statistics of some of the input and output data displayed by the Web tool. It can be seen that the input data total 1.6TB size, the total number of design records is 6.96 billion. That is, this data records the site's 6.96 billion PV. The lower left corner shows that the execution time is longer and takes 18 minutes + 46 seconds. The reason for this slow is not reduce, but my map's slots is too few, only 300, a total of more than 10,000 map, That would take hundreds of waves to finish the map, so the bottleneck is here at map. The bottom right corner is the statistical result data, it can be seen that the overall PV of the site is showing an upward trend.

At this point, a simple map/reduce program is written and run.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.