Hadoop practice-Hadoop MapReduce Advanced Programming

Source: Internet
Author: User
Tags hadoop mapreduce
Part 1: important component Combiner • What is Combiner • combine function combines the <key, value> pair (multiple keys, values) produced by a map function) merged into a new <key2, value2>. use the new <key2, value2> as the input to the reduce function. The format is the same as that of the reduce function. • This can effectively reduce intermediate results and reduce network transmission load. • Under what conditions can Combiner be used • scenarios where records can be summarized and statistically analyzed, such as summation. • The timing of Combiner execution cannot be used in the scenario of averaging the mean. • The timing of running the combiner function may be prior to merge's completion or later, which can be controlled by a parameter, that is Min. num. spill. for. combine(Default 3) • when the combiner is set in the job and the spill number has at least three, the combiner function runs before merge generates the result file. • In this way, in this way, when spill requires a lot of merge and a lot of data requires conbine, we can reduce the amount of data written to the disk file, also to reduce the disk read/write frequency, it is possible to optimize the job. • The Combiner may not be executed. The Combiner will consider the Cluster load at that time. How to Use Combiner • code example • inherit Reducer class public static class Combiner extends MapReduceBase implements CER <Text, Text, Text> {public void reduce (Text key, iterator <Text> values, OutputCollector <Text, Text> output, Reporter reporter) throws IOException {}• add conf when configuring the job. setCombinerClass (Combiner. class) Partitioner • What is Partitioner • Mapreduce partitions keys through Partitioner to distribute data as needed. • When to use Partitioner • If you want keys to be distributed as needed, you need such components. • For example, if a data file contains a province, each province must output a file. • Framework default HashPartitioner • public class HashPartitioner <K, V> extends Partitioner <K, V> {

/** Use {@ link Object # hashCode ()} to partition .*/

Public int getPartition (K key, V value,

Int numReduceTasks ){

Return (key. hashCode () & Integer. MAX_VALUE) % numReduceTasks;

}

} How to Use Partitioner • Implement the Partitioner interface to overwrite the getPartition () method • Add conf when configuring a job. setPartitionerClass (MyPartitioner. class); • Partitioner example public static class MyPartitioner implements Partitioner <Text, Text >{@ Override

Public int getpartition (Text key, text value, int numpartitions) {} partitioner requirement example • Requirement Description • provinces contained in data files • the same province needs to be sent to the same reduce • different files are generated • Data Sample • 1 Liaoning • 1 number of municipalities in the province • Step • Implement partitioner, overwrite getpartition • split by province Field

RecordReader

• What is a RecordReader • used to read the <Key, Value> pair in a chunk, that is, this class is called every time we read a record. • Mainly processing data after InputFormat fragment • when to use RecordReader • processing input data as needed • For example: the Input key is not the file offset but the file path or name. • The system defaults to LineRecordReader. • The offset of each line is used as the key value for map output, the content of each line is the value of map. The default Delimiter is carriage return and line feed.

RecordReader requirement example

• Requirement • Change the <key, value> value of the input corresponding to map, the path (or file name) of the file corresponding to the key, and the value corresponds to the content of the file ). • Step • rewrite InputFormat not to split files • rewrite RecordReader • use custom components for data processing during job Configuration

Part 2: Join case analysis • the input is two files. The content of file 1 is as follows • Space Division: User Name mobile phone number age • Content example • Tom 1314567890 14 • file 2 content • Space Division: mobile phone number city • Content example • 13124567890 hubei • The summary information to be collected is the user name mobile phone number Age city Map side Join • Design Concept • Use DistributedCache. addCacheFile () adds a local file to the cache of all maps • read the file in the Map function, perform Join • output the result to reduce • note that • DistributedCache must be used before generating a Job

Reduce end Join

• Design Concept • Map side reads all files, add an identifier to the output content to indicate the file from which the data is generated. • store the data according to the identifier in reduce. • directly output the result based on the Join of the Key.

Part 3: Sorting

Common sorting • Mapreduce's built-in sorting function • Text objects are not suitable for sorting. if the content is an integer, they are not sorted in the encoding order. • In general, we can consider using IntWritable as the Key, at the same time, set Reduce to 0 and sort partial sorting • Each output file is out of order • if we do not need global sorting, this is a good choice.

Global sorting • background • The Hadoop platform does not provide global data sorting, and global data sorting during large-scale data processing is a very common requirement. • The most intuitive way to sort a large amount of data using hadoop is to directly output all the file content to a reduce without any processing, sort all data using hadoop's own shuffle mechanism, and then directly output the data by reduce. • The basic step of quick sorting is to select one of all data as the pivot point. Then place the pivot greater than this pivot point on one side and the pivot less than this pivot point on the other side.

Imagine if we have
N
(This can be called a ruler), you can divide all the data
N + 1
Items
Part
, Set this
N + 1
Items
Part
To
Reduce
,
Hadoop
Automatic Sorting and final output
N + 1
An internal ordered file.
N + 1
Files are merged into one file at the beginning and end.
. Therefore, we can sum up such an application.
Hadoop
Steps for sorting large amounts of data: 1
)

Sample sorted data; 2
)

Sort the sample data to generate a scale; 3
)

Map
Calculate the two scales between each input data entry, and send the data to the corresponding interval.
ID
Of
Reduce 4
)

Reduce
Output the obtained data directly. • Hadoop provides the sampler interface to return a set of samples. This interface is a hadoop sampler. Public interface sampler <K, V> {k [] getsample (inputformat <K, V> INF, job) throws ioexception, interruptedexception;} • hadoop provides a totalorderpartitioner, we can achieve global sorting. Secondary sorting • background • mapreduce sorts keys by default • Pre-sorts values that are output to reduce • Implementation • rewrite partitioner to partition keys, perform the first sorting • Implement writablecomparator to complete its own sorting logic, 2nd sorting of keys • Principle • Data key1 1 key2 2 key2 3 key3 4 key1 2 • mapduce can only sort keys, so for the secondary sorting, we need to re-define our own key. Simply put, <key value> value, after the combination, <key1 1> 1 <key2 2> 2 <key2 3> 3 <key3 4> 4 <key1 2> 2

• Principle • Next we will implement custom sorting classes and grouping classes, data becomes <key1 1> 1 <key1 2> 2 <key2 2> 2 <key2 3> 3 <key3 4> 4 • output result key1 1 1 key1 2 key2 3 key3 4

Part 4: counters • What are counter counters? They are mainly used to collect system information and job operation information, and are used to know the success or failure of a job, which is more convenient for analysis than logs. • Built-in counters • Hadoop's built-in counters record job execution and record. It includes the MapReduce framework, file system, and job count. • Counters are maintained by associated tasks and regularly transmitted to tasktracker, and then to jobtracker. • Counters can be globally aggregated. The built-in job counters are actually maintained by jobtracker and do not need to be transferred throughout the network. • When a job is successfully executed, the counter value is complete and reliable.

Custom Java counters

• The MapReduce framework allows users to customize counters • counters are globally used • counters are grouped and can be defined by a Java Enumeration type • how to configure • 0.20.2 and earlier versions use Reporter, • 0.20.2 and later versions use context. getCounter (groupName, counterName) to get counter configuration and set. • Dynamic counters • the so-called dynamic counters do not use Java enumeration to define • methods for obtaining dynamic counters in Reporter • public void incrCounter (String group, String counter, long amount) group name, counter name, and counter value • some principles • try to make the name readable when creating a counter

• Obtain counters • Web UI • Command Line hadoop job-counter • Java API • after the job is run, obtain the counters after they are stable. Use job. getCounters () to obtain Counters

Part 5: Examples of merging small files • background • Hadoop is not suitable for processing small files • it occupies a large amount of memory space • solution • file content is read to SequenceFile

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.