Counter
Because the counter view is often more convenient than viewing the cluster log
So in some cases the counter information is more efficient than the cluster log
User-definable counters
A description of the built-in counters for Hadoop can be found in the Nineth chapter of the build-in counts in MapReduce features, the authoritative guide to Hadoop
This is limited to the space no longer explains
MapReduce allows users to customize counters in a program using the format class of enumerations or strings
A job can define an unlimited counter, when using an enumeration type
the name of the enumeration type is the group name, and the field of the enumeration type is the counter name
The counters are global , are used across all mapper and reducer, and produce a result at the end of the job
For example, the existing enumeration type is as follows:
enum Temperature{ MISSING, MALFORMAT}
In a mapreduce program, you can use counters like this:
context.getCounter(Temperature.MISSING).increment(1);context.getCounter(Temperature.MALFORMAT).increment(1);
Dynamic counters
Since enumeration types determine all fields at compile time, in some cases we may want to name the counters based on unknown names
At this point, you can use dynamic counters to achieve:
context.getCounter("计数器组名","计数器名").increment(1);
Here the counter name can be obtained in any way, such as dynamically obtained field values, etc.
In most cases, however, enumeration types can be used enough, and enumeration types are more readable, easy to use, and type-safe.
Therefore, it is recommended to use enumerated types whenever possible
Get the value of a counter in code
In addition to obtaining counters for jobs through the web UI, CLI, and-counter parameters, the user can also get the value of the counter in the program through code:
string jobId = Args[0 ]; Cluster Cluster = new Cluster (getconf ()); Job Job = Cluster.getjob (Jobid.forname (JobId)); if (Job = = null ) {System.err.println ( "No job Whih ID%s found" , jobId); return -1 ;} if (!job.iscomplete ()) {System.err.println ( "Job%s is not complete" , jobId); return -1 ;} Counters Counters = Job.getcounters (); //key code long missing = Conters.findcounter (temperature.missing). GetValue (); long total = Counters.findcounter (taskcounter.map_input_records). GetValue ();
Sort partial Sort
Partial ordering is the process of sorting the data in each partition during the map phase
You can see in the Hadoop submission job custom sorting and grouping
There is only one way to control partial ordering in MapReduce, and the order of control is as follows:
1. If you set the Mapreduce.job.output.key.comparator.class property or the Setcomparatorclass () method, use the set class for partial sorting
2. Otherwise, the key must be a subclass of writablecomparable and use a registered comparator for that key type
3. Otherwise, use Rawcomparator to deserialize the byte stream into an object and call the Writablecomparable Comparato () method
We inherited the custom data type from writablecomparable, and overridden the Comparato method, where the setting is the last to be used.
If you define a specific implementation class for Rawcomparator/writablecomparator, this setting will be preferred, as it can directly compare the byte stream array
Full sort
The shuffle phase of the MapReduce is sorted only for individual partitions, that is, the partial ordering discussed previously
For each partition, the data is ordered, but from the overall data, it is unordered.
How do you get mapreduce to produce globally ordered data?
The simplest approach is to use only one partition, but this loses the features of the MapReduce parallel computation and must process all the data on a single machine
In fact, in addition to using one partition, there is another way to achieve both global order and the ability to take full advantage of the parallel computing power of MapReduce
But this method needs to do some extra work.
Consider that in a partial sort, the data in each partition is ordered, but from a partitioned point of view it is unordered.
What if we can ensure that the partitioning is also orderly? , for example, partition 1 saves 1-100 of the data, Partition 2 saves 101-200 of the data, one analogy
Then from the partition point of view, the partitions are orderly, and the data inside the partition is naturally ordered
So that the global order of the data is achieved
But there's a case to be aware of in this process: How do you make sure that the amount of data allocated for each partition is evenly distributed?
Because in the actual scenario, 1-100 contains 1000 data, and 101-200 of the data is only 50, which creates the problem of data skew
To solve this problem, we often need to understand the composition of the data in depth
However, in the case of massive data, it is impossible to check all the data
At this point we can use the sampling method to do
The core idea of sampling is to look at only a small number of keys and get an approximate distribution of the keys to build the partition
Several samplers have been built into Hadoop, with the following interfaces:
publicinterface Sampler<K,V>{ throw IOException,InterruptedException;}
However, this getsample interface is not normally used directly, but is called by the Inputsampler Writepartitionfile method.
The goal is to create a sequencefile to store the key that defines the partition
publicstaticwritePartitionFilethrow IOException,ClassNotFoundException,InterruptedException
The sequencefile will be used by Totalorderpartitioner to create a partition for the job:
//Set partition class to Totalorderpartitioner Job.setpartitionerclass (Totalorderpartitioner.class); //using a random sampler, sampling rate of 0.1, maximum number of samples and the maximum number of partitions is 10000, 10, any one condition to stop sampling immediately after the inputsampler.sampler<intwritable,text> Sampler = new Inputsampler.randomsampler<intwritable,text> (0.1 , 10000 , 10 ); //Use this sampler to create a sequencefile that defines the partition key. Inputsampler.writepartitionfile (Job,sampler); ///get the Sequencefile and join the distributed cache to share the String partitionfile = totalorderpartitioner.getpartitionfile (conf); Uri uri = new uri (partitionfile); Jov.addcachefile (URI);
This sampler will be running on the client , so the data will be downloaded from the cluster, it is important to note that the amount of data to download is not too large or run for a long time
This method can also be used to set the number of reducer tasks , that is, the number of partitions, through the mapreduce.job.reducers to set the final need to generate a number of uniform partitions
Randomsampler is a more general-purpose sampler, besides it, there are other examples:
- Splitsampler: Only the first n records in a shard are sampled, not extensively sampled from all shards, so it is not suitable for already sequenced data
- Intervalsampler: Select a key from a shard at a certain interval, so it's well suited for sorted data
Two Orders
Two orders are sorted by the value of the data, but in the serialization section of the Hadoop I/O
has already discussed this question, the specific case may refer to: Hadoop submits the job custom sorting and grouping
Join connection
How and how to connect using MapReduce depends on the size and structure of the dataset
If one data set is large and the other is small, you can use the Distributedcache in MapReduce completely
Distributing small datasets to individual nodes
If the two datasets are large, then the join on the map side and the reduce side can be divided into
Join at map end
The join operation at the map end executes before the data reaches the map function
For this purpose, the input data for the map end must:
1. Two datasets are divided into the same number of partitions
2. Two datasets sorted by the same key
Since map can set the output of multiple jobs that were previously executed for its input, the following conditions
At this point the input data should meet:
1. Two jobs have the same reduce quantity
2. The key is the same and indivisible
After meeting the requirements of the map-side join operation, you can take advantage of the Comsiteinputformat class in the Org.apache.hadoop.mapreduce.join package to perform a join operation before the map function
Join at the reduce end
The join of the reduce side is less demanding on the data than the map end, and records with the same key shuffle are entered into the same reducer (partition) feature
The reducer side can be used for the natural join operation, but because the data has to go through the shuffle process, the efficiency is often lower than the join on the map side.
And in the reduce-side join, you can also take advantage of the previous two-order discussion
Sometimes a join connection requires a dataset to reach the reduce function before another data set, and then we can listen to the two order to make a label for the value of the data.
The first data label to be reached is set to 0, the other dataset is set to 1, and then sorted based on this label to enable the desired dataset to reach the reduce first
Edge Data distribution
The so-called Edge data (Side) can be understood as the MapReduce job execution process
All tasks have the potential to use read- only data to aid in processing master data
Using Jobconfiguration
The various setter methods of the configuration class can conveniently set some key-value pairs of data types.
User can obtain configuration information through the GetConfiguration method
This is enough to handle a lot of situations where you just need to set some properties
But its drawbacks are:
- Only for small data with similar property settings
- For very complex objects, users need to set their own serialization and deserialization
- All settings will read memory every time the configuration is read, regardless of
Distributedcache
Distributed cache mechanism copies user-set data to individual nodes for use before the job runs
The cache size defaults to 10G and can be configured (in bytes) by yarn.nodemanager.localizer.cache.target-size-mb
Specific usage information: Distributedcache in MapReduce
@ Little Black
MapReduce Advanced Features