MapReduce Advanced Features

Last Update:2016-05-12 Source: Internet

Author: User

Tags shuffle

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Counter

Because the counter view is often more convenient than viewing the cluster log
So in some cases the counter information is more efficient than the cluster log

User-definable counters

A description of the built-in counters for Hadoop can be found in the Nineth chapter of the build-in counts in MapReduce features, the authoritative guide to Hadoop
This is limited to the space no longer explains

MapReduce allows users to customize counters in a program using the format class of enumerations or strings
A job can define an unlimited counter, when using an enumeration type
the name of the enumeration type is the group name, and the field of the enumeration type is the counter name
The counters are global , are used across all mapper and reducer, and produce a result at the end of the job

For example, the existing enumeration type is as follows:

enum Temperature{    MISSING,    MALFORMAT}

In a mapreduce program, you can use counters like this:

context.getCounter(Temperature.MISSING).increment(1);context.getCounter(Temperature.MALFORMAT).increment(1);

Dynamic counters

Since enumeration types determine all fields at compile time, in some cases we may want to name the counters based on unknown names
At this point, you can use dynamic counters to achieve:

context.getCounter("计数器组名","计数器名").increment(1);

Here the counter name can be obtained in any way, such as dynamically obtained field values, etc.
In most cases, however, enumeration types can be used enough, and enumeration types are more readable, easy to use, and type-safe.
Therefore, it is recommended to use enumerated types whenever possible

Get the value of a counter in code

In addition to obtaining counters for jobs through the web UI, CLI, and-counter parameters, the user can also get the value of the counter in the program through code:

string jobId = Args[0 ]; Cluster Cluster = new  Cluster (getconf ()); Job Job = Cluster.getjob (Jobid.forname (JobId)); if  (Job = = null ) {System.err.println ( "No job Whih ID%s found" , jobId); return -1 ;} if  (!job.iscomplete ()) {System.err.println ( "Job%s is not complete" , jobId); return -1 ;} Counters Counters = Job.getcounters (); //key code  long  missing = Conters.findcounter (temperature.missing). GetValue (); long  total = Counters.findcounter (taskcounter.map_input_records). GetValue ();

Sort partial Sort

Partial ordering is the process of sorting the data in each partition during the map phase

You can see in the Hadoop submission job custom sorting and grouping
There is only one way to control partial ordering in MapReduce, and the order of control is as follows:

1. If you set the Mapreduce.job.output.key.comparator.class property or the Setcomparatorclass () method, use the set class for partial sorting
2. Otherwise, the key must be a subclass of writablecomparable and use a registered comparator for that key type
3. Otherwise, use Rawcomparator to deserialize the byte stream into an object and call the Writablecomparable Comparato () method

We inherited the custom data type from writablecomparable, and overridden the Comparato method, where the setting is the last to be used.
If you define a specific implementation class for Rawcomparator/writablecomparator, this setting will be preferred, as it can directly compare the byte stream array

Full sort

The shuffle phase of the MapReduce is sorted only for individual partitions, that is, the partial ordering discussed previously
For each partition, the data is ordered, but from the overall data, it is unordered.
How do you get mapreduce to produce globally ordered data?
The simplest approach is to use only one partition, but this loses the features of the MapReduce parallel computation and must process all the data on a single machine

In fact, in addition to using one partition, there is another way to achieve both global order and the ability to take full advantage of the parallel computing power of MapReduce
But this method needs to do some extra work.

Consider that in a partial sort, the data in each partition is ordered, but from a partitioned point of view it is unordered.
What if we can ensure that the partitioning is also orderly? , for example, partition 1 saves 1-100 of the data, Partition 2 saves 101-200 of the data, one analogy
Then from the partition point of view, the partitions are orderly, and the data inside the partition is naturally ordered
So that the global order of the data is achieved

But there's a case to be aware of in this process: How do you make sure that the amount of data allocated for each partition is evenly distributed?
Because in the actual scenario, 1-100 contains 1000 data, and 101-200 of the data is only 50, which creates the problem of data skew

To solve this problem, we often need to understand the composition of the data in depth
However, in the case of massive data, it is impossible to check all the data
At this point we can use the sampling method to do

The core idea of sampling is to look at only a small number of keys and get an approximate distribution of the keys to build the partition

Several samplers have been built into Hadoop, with the following interfaces:

publicinterface Sampler<K,V>{    throw IOException,InterruptedException;}

However, this getsample interface is not normally used directly, but is called by the Inputsampler Writepartitionfile method.
The goal is to create a sequencefile to store the key that defines the partition

publicstaticwritePartitionFilethrow IOException,ClassNotFoundException,InterruptedException

The sequencefile will be used by Totalorderpartitioner to create a partition for the job:

//Set partition class to Totalorderpartitioner  Job.setpartitionerclass (Totalorderpartitioner.class); //using a random sampler, sampling rate of 0.1, maximum number of samples and the maximum number of partitions is 10000, 10, any one condition to stop sampling immediately after the  inputsampler.sampler<intwritable,text> Sampler = new  Inputsampler.randomsampler<intwritable,text> (0.1 , 10000 , 10 ); //Use this sampler to create a sequencefile  that defines the partition key. Inputsampler.writepartitionfile (Job,sampler); ///get the Sequencefile and join the distributed cache to share the  String partitionfile = totalorderpartitioner.getpartitionfile (conf); Uri uri = new  uri (partitionfile); Jov.addcachefile (URI);

This sampler will be running on the client , so the data will be downloaded from the cluster, it is important to note that the amount of data to download is not too large or run for a long time
This method can also be used to set the number of reducer tasks , that is, the number of partitions, through the mapreduce.job.reducers to set the final need to generate a number of uniform partitions

Randomsampler is a more general-purpose sampler, besides it, there are other examples:

Splitsampler: Only the first n records in a shard are sampled, not extensively sampled from all shards, so it is not suitable for already sequenced data

Intervalsampler: Select a key from a shard at a certain interval, so it's well suited for sorted data

Two Orders

Two orders are sorted by the value of the data, but in the serialization section of the Hadoop I/O
has already discussed this question, the specific case may refer to: Hadoop submits the job custom sorting and grouping

Join connection

How and how to connect using MapReduce depends on the size and structure of the dataset
If one data set is large and the other is small, you can use the Distributedcache in MapReduce completely
Distributing small datasets to individual nodes

If the two datasets are large, then the join on the map side and the reduce side can be divided into

Join at map end

The join operation at the map end executes before the data reaches the map function
For this purpose, the input data for the map end must:

1. Two datasets are divided into the same number of partitions
2. Two datasets sorted by the same key

Since map can set the output of multiple jobs that were previously executed for its input, the following conditions
At this point the input data should meet:

1. Two jobs have the same reduce quantity
2. The key is the same and indivisible

After meeting the requirements of the map-side join operation, you can take advantage of the Comsiteinputformat class in the Org.apache.hadoop.mapreduce.join package to perform a join operation before the map function

Join at the reduce end

The join of the reduce side is less demanding on the data than the map end, and records with the same key shuffle are entered into the same reducer (partition) feature
The reducer side can be used for the natural join operation, but because the data has to go through the shuffle process, the efficiency is often lower than the join on the map side.

And in the reduce-side join, you can also take advantage of the previous two-order discussion
Sometimes a join connection requires a dataset to reach the reduce function before another data set, and then we can listen to the two order to make a label for the value of the data.
The first data label to be reached is set to 0, the other dataset is set to 1, and then sorted based on this label to enable the desired dataset to reach the reduce first

Edge Data distribution

The so-called Edge data (Side) can be understood as the MapReduce job execution process
All tasks have the potential to use read- only data to aid in processing master data

Using Jobconfiguration

The various setter methods of the configuration class can conveniently set some key-value pairs of data types.
User can obtain configuration information through the GetConfiguration method

This is enough to handle a lot of situations where you just need to set some properties
But its drawbacks are:

Only for small data with similar property settings

For very complex objects, users need to set their own serialization and deserialization

All settings will read memory every time the configuration is read, regardless of

Distributedcache

Distributed cache mechanism copies user-set data to individual nodes for use before the job runs
The cache size defaults to 10G and can be configured (in bytes) by yarn.nodemanager.localizer.cache.target-size-mb

Specific usage information: Distributedcache in MapReduce

@ Little Black

MapReduce Advanced Features

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More