Document directory
- Built-in counters
- User-Defined Java counters
- User-Defined streaming counters
- Total sort
- Secondary sort
- Using the job Configuration
- Distributed cache
Counters
There are often things you wowould like to know about the data you are analyzing but that are peripheral to the analysis you are inserting Ming. for example, if you were counting invalid records and discovered that the proportion of invalid records in the whole dataset was very high, you might be prompted to check why so far records were being marked as invalid-perhaps there is a bug in the part of Program that detects invalid records?
Counters are a useful channel for gathering statistics about the job: for quality control or for application level-statistics.
It is easy to understand that counters implement Global Counter on distributed and massive nodes for Monitoring and recording analysis results.
This is easy to implement. Each node is counted independently and reported to jobtraker and jobtraker respectively to obtain the final value. Therefore, when the job is not finished, this counter value is incorrect. because there may be multiple tasks attempt, the intermediate result may count too much. When the job ends, only one attempt result will be retained.
Built-in counters
Hadoop maintains some built-in counters for every job, which report various metrics for your job. for example, there are counters for the number of bytes and records processed, which allows you to confirm that the expected amount of input was consumed and the expected amount of output was produced.
Counters are divided into groups, and there are severalGroupsFor the built-in counters, listed in
Mapreduce task counters
Org. Apache. hadoop. mapred. Task $ counter (0.20)
Org. Apache. hadoop. mapreduce. taskcounter (post 0.20)
Filesystem counters
Filesystemcounters (0.20)
Org. Apache. hadoop. mapreduce. filesystemcounter (post 0.20)
Fileinput-format counters
Org. Apache. hadoop. mapred. fileinputformat $ counter (0.20)
Org. Apache. hadoop. mapreduce. Lib. Input. fileinputformatcounter (post 0.20)
Fileoutput-format counters
Org. Apache. hadoop. mapred. fileoutputformat $ counter (0.20)
Org. Apache. hadoop. mapreduce. Lib. Output. fileoutputformatcounter (post 0.20)
Job counters
Org. Apache. hadoop. mapred. jobinprogress $ counter (0.20)
Org. Apache. hadoop. mapreduce. jobcounter (post 0.20)
It can be seen that hadoop itself relies on counter to monitor the hadoop job status, so a large number of built-in counters are set, in the end, most of the data on the hadoop management interface comes from the built-in counter.
User-Defined Java counters
Mapreduce allows user code to define a set of counters, which are then incremented as desired in the mapper or CER Cer. counters are defined by a Java Enum, which serves to group related counters. A job may define an arbitrary number of enums, each with an arbitrary number of fields. the name of the enum is the group name, And the enum's fields are the counter names.
Counters areGlobal: The mapreduce framework aggregates them processing SS all maps and CES to produce a grand total at the end of the job.
Dynamic counters
The Code makes use of a dynamic Counter-one that isn't defined by a Java enum. since a Java Enum's fields are defined at compile time, you can't create new counters on the fly using enums. here we want to count the distribution of temperature quality codes, and though the format specification defines the values that it can take, it is more con venient to use a dynamic counter to emit the values that it actually takes. the method we use on the reporter object takes a group and counter name using string names:
Public void incrcounter (string group, string counter, long amount)
Retrieving counters
In addition to being available via the web UI and the command line (using hadoop job-counter), you can retrieve counter values using the Java API. you can do this while the job is running, although it is more usual to get counters at the end of a job run, when they are stable.
Counters counters = job.getCounters();long missing = counters.getCounter(MaxTemperatureWithCounters.Temperature.MISSING);long total = counters.findCounter("org.apache.hadoop.mapred.Task$Counter", "MAP_INPUT_RECORDS").getCounter();
User-Defined streaming counters
A Streaming mapreduce program can increment Counters by sending a specially formatted line toStandard Error stream, Which is co-opted as a control channel in this case. The line must have the following format:
Reporter: Counter: group, counter, amount
This snippet in Python shows how to increment the "missing" counter in the "temperature" Group by one:
SYS. stderr. Write ("Reporter: Counter: temperature, missing, 1 \ n ")
Sorting
The ability to sort data is at the heart of mapreduce. even if your application isn't concerned with sorting per se, it may be able to use the sorting stage that mapreduce provides to organize its data. in this section, we will examine different ways of sorting datasets and how you can control the sort order in mapreduce.
Sorting, as the core of hadoop, is the default function provided by the platform, because the final output is sorted by key by default.
So what is the system ordering keys?
Controlling sort order
The sort order for keys is controlled byRawcomparator, Which is found as follows:
1. If the property mapred. Output. Key. comparator. Class is set, either explicitly or by callingSetsortcomparatorclass ()On job, then an instance of that class is used. (In the old API the equivalent method is setoutputkeycomparatorclass () on jobconf .)
2. Otherwise,KeysMust be a subclassWritablecomparable, And the registered comparator for the key class is used.
3. If there is no registered comparator, then a rawcomparator is used that deserializes the byte streams being compared into objects and delegates to the writablecomparable's compareto () method.
These rules reinforce why it's important to register optimized versions of rawcomparators for your own writm writable classes (which is covered in "Implementing a raw-comparator for speed" on page 108 ), and also that it's straightforward to override the sort order by setting your own comparator (we do this in "secondary sort" On page 276 ).
If the sortcomparatorclass of the job is set, this will be used for sorting. Otherwise, the key must be writablecomparable, and it is best to define comparator to avoid deserializes to objects for each comparison.
In hadoop 0.20.2 API, the job is redefined to the org. Apache. hadoop. mapreduce. Job class.
It has three special methods:
Job.Setsortcomparatorclass(Rawcomparator C); Define the comparator that controls how the keys are sorted before they are passed toReducer
.
Job.Setpartitionerclass(Partitioner P); setPartitioner
For the job.
Job.Setgroupingcomparatorclass(Rawcomparator C); Define the comparator that controls which keys are grouped together for a single callReducer.reduce
If you may not use these three APIs, the process of the system is as follows:
1. The key must be writablecomparable. Therefore, use writablecomparable's compareto () to sort the key by default.
2. Use the default hash Method for partition by key
3. The group is key by default, and the values of the same key form a list and pass it to Cer.
The system will replace the above processing logic with user-defined classes. The following algorithms mostly rewrite these three functions to implement different functions.
Total sort
How can you produce a globally sorted file using hadoop? The naive answer is to use a single partition. but this is incredibly inefficient for large files, since one machine has to process all of the output, so you are throwing away the benefits of the parallel architecture that mapreduce provides.
The results produced by the above method are ordered for each CER, but different reducer cannot be ordered unless only one reducer is used, but this is unrealistic when the data volume is large.
What should I do?
The idea is very simple. To understand the data distribution, rewrite partitioner to partition the key according to the range, so as to ensure global order.
Temperature range <-5.6 °C [-5.6 °C, 13.9 °C) [13.9 °C, 22.0 °C)> = 22.0 °C
Proportion of records 29% 24% 23% 24%
Secondary sort
The mapreduce framework sorts the records by key before they reach the specified CERs.
For any participant key, however, the values areNot sorted. The order thatValues appear is not even stable from one run to the next, Since they come from different map tasks, which may finish at different times from run to run.
Generally speaking, most mapreduce programs are written so as not to depend on the order that the values appear to the reduce function. However, it is possible to impose an order on the values
By sorting and grouping the keys in a particle way.
If you need to group by value and ensure that the values of the same key are also sorted, what should we do?
To summarize, there is a recipe here to get the effect of sorting by value:
• Make the keyCompositeOfNatural key and the natural value.
• The sortComparatorShocould order by the composite key, that is, the natural key and natural value.
•PartitionerAndGrouping ComparatorFor the composite key shocould consider only the natural key for partitioning and grouping.
This is a typical example of implementing functions by rewriting the three functions mentioned above,
In addition to sorting by key and valueCompositeKey, so that the value can be obtained in the subsequent sorting.
RewriteSortcomparatorclass: sort by key first, and then sort by value after the same key.
This seems to have been completed, but it is not that simple, because you need to put the values of the same key into a list and pass it to Cer.
Therefore, ensure that the data of the same key is partition to the same CER Cer. In this way, you must overwrite the partitionerclass. OtherwiseCompositeKey for partition
You also need to override groupingcomparatorclass to ensure that the Group is based on the key insteadCompositeKey
Joins
Mapreduce can perform joins between large datasets, but writing the code to do joins from scratch is fairly involved. rather than writing mapreduce programs, you might consider using a higher-level framework such as pig, hive, or cascading, in which join operations are a core part of the implementation.
For join, I suggest using pig or hive. Of course we can talk about the idea here, but there is really no need to write join on our own.
Join two files. If one of them is relatively small, you can load it to memory. In this case, put the small file into the distributed cache, and then each mapper loads the small file to memory, you can filter out large files.
If both files are large, this is more troublesome.
First, the two files are used as the Mapper input as follows, and the join value is used as the map key.
MultipleInputs.addInputPath(job, ncdcInputPath,TextInputFormat.class, JoinRecordMapper.class);MultipleInputs.addInputPath(job, stationInputPath,TextInputFormat.class, JoinStationMapper.class);FileOutputFormat.setOutputPath(job, outputPath);
As you can imagine, in cer CER, all the values with this key are in a list, respectively, to find the data that belongs to different input files.
The remaining problems must be explained using examples. Join the two tables as follows:
Station ID station name
011990-99999 sihccajavri
Station ID Other information
011990-99999 0067011990999991950051507004 + 68750...
011990-99999 0043011990999991950051512004 + 68750...
011990-99999 0043011990999991950051518004 + 68750...
The station ID of the two tables is used as the key, and the data from mappper to Cer CER is as follows:
011990-99999, [0067011990999991950051507004 + 68750..., 0043011990999991950051512004 + 68750..., 0043011990999991950051518004 + 68750..., sihccajavri]
Sihccajavri is the worst case at the end of the value list. In fact, you cannot guarantee where it will appear in the list, because the Mr process will not sort the value by default.
The final result of join is as follows,
Station ID Station Name Other information
011990-99999 sihccajavri 0067011990999991950051507004 + 68750...
011990-99999 sihccajavri 0043011990999991950051512004 + 68750...
011990-99999 sihccajavri 0043011990999991950051518004 + 68750...
Therefore, it is necessary to traverse the list, locate sihccajavri first, and then connect it with other values. The value before sihccajavri is not found must be buffer to memory.
This is no problem for small data, but for big data, it is likely to correspond toThe other information of the station ID is very large, so memory buffer may be cracked.
So the solution is to ensure that we can always read sihccajavri in the first one.
How to ensure that the method is secondary sort,
Use (Station ID, Document ID)
Then, press the combination key for sort and the station ID for partition and group.
Side Data Distribution
Side data can be defined as extra read-only data needed by a job to process the main dataset. the challenge is to make side data available to all the map or reduce tasks which are spread into ss the cluster) in a convenient and efficient fashion.
Using the job Configuration
You can set arbitrary key-value pairs in the job configuration using the various setter methods on configuration (or jobconf in the old mapreduce API ). this is very useful if you need to pass a small piece of metadata to your tasks.
If you need to pass some parameters and a small amount of data, it is enough to use job configuration.
Distributed cache
Rather than serializing side data in the job configuration, it is preferable to distribute datasets using hadoop's distributed cache mechanic. this provides a service for copying files and archives to the task nodes in time for the tasks to use them when they run. to save network bandwidth, files are normally copied to any special node once per job.
If the data is large, such as a corpus, you need to use distributed cache
How it works
When you launch a job, hadoop copies the Files specified by the-files,-archives and-libjars options to the jobtracker's filesystem (normally HDFS ).
Then, before a task is run, the tasktracker copies the files from the jobtracker's filesystem to a local disk-the cache-so the task can access the files. the files are said to be localized at this point.
From the task's point of view, the files are just there (and it doesn't care that they came from HDFS ). in addition, Files specified by-libjars are added to the task's classpath before it is launched.
For distributed cache, three parameters, file is better understood. Archives will be automatically decompressed, while libjars will automatically add jar to Java class path
During job launch, these files will be copied to HDFS, and then tasktracker will copy these files to the local disk before task run for ease of access.
The tasktracker also maintains a reference count for the number of tasks using each file in the cache. before the task has run, the file's reference count is incremented by one; then after the task has run, the count is decreased by one. only when the count reaches zero is it eligible for deletion, since no tasks are using it. files are deleted to make room for a new file when the cache exceeds a certain size-10 GBBy default. The cache size may be changed by setting the configuration property local. cache. size, which is measured in bytes.
In short, the reference count ensures that these files are deleted when no task is used.