1. Concept2. ReferencesImprove the MapReduce job Efficiency Note II of Hadoop (use combiner as much as possible): Http://sishuo (k). com/forum/blogpost/list/5829.htmlHadoop Learning notes -8.combiner and custom Combiner:http://www.tuicool.com/articles/qazujavHadoop in-depth learning: combiner:http://blog.csdn.net/cnbird2008/article/details/23788233(mean Scene) 0H
ArticleDirectory
Declare combiner Function
Many mapreduceProgramLimited by the available bandwidth on the cluster, it will try its best to minimize the intermediate data that needs to be transmitted between map and reduce tasks. Hadoop allows you to declare a combiner function to process map output, and use your own map processing result as the re
One: BackgroundIn the MapReduce model, the function of reduce is mostly statistical classification type of total, the maximum value of the minimum value, etc., for these operations can consider the map output after the combiner operation, so as to reduce network transport load, while reducing the burden of reduce tasks . The combiner operation is run on each node, only affects the output of the local map,
What is combiner Functions
“Many MapReduce jobs are limited by the bandwidth available on the cluster, so it paysto minimize the data transferred between map and reduce tasks. Hadoop allows the user to specify a combiner function to be run on the map output—the combiner function’soutput forms the input to the reduce fu
As we all know, the hadoop framework uses Mapper to process data into a
In the above process, we can see at least two performance bottlenecks:
If we have 1 billion million data records, mapper will generate 1 billion key-value pairs for transmission across the network, but if we only calculate the maximum value for the data, obviously, mapper only needs to output the maximum value it knows. This not only reduces the network pressure, but also grea
The bandwidth available on the cluster limits the number of mapreduce jobs, so the most important thing to do is to avoid the data transfer between the map task and the reduce task as much as possible. Hadoop allows users to specify a merge function for the output of the map task, and sometimes we also call it combiner, which is like mapper and reducer.The output of the merge function as input to the reduce
As we all know, the hadoop framework uses Mapper to process data into a
In the above process, we can see at least two performance bottlenecks:
If we have 1 billion million data records, mapper will generate 1 billion key-value pairs for transmission across the network, but if we only calculate the maximum value for the data, obviously, mapper only needs to output the maximum value it knows. This not only reduces the network pressure, but also grea
is canceled and the theThe Combine function is executed when the line comment is taken. [Main] INFO org.apache.hadoop.mapreduce.job-counters: + FileSystem Counters ...Map-ReduceFrameworkMapInput records=6 MapOutput records= A......InputSplit bytes=192Combine input records= ACombine Output records=9......ReduceInput records=9 ReduceOutput records=7Spilled records= -...... Virtual memory (bytes) snapshot=0 TotalCommitted heap usage (bytes) =457912320 File Input FormatC
1.CombinerCombiner is an optimization method for MapReduce. Each map can generate a large amount of local output, and the Combiner function is to merge the output of the map end first to reduce the amount of data transferred between the map and reduce nodes to improve network IO performance. The combiner can be set only if the operation satisfies the binding law.The role of
Partitioner Programming data that has some common characteristics is written to the same file. Sorting and grouping when sorting in the map and reduce phases, the comparison is K2. V2 are not involved in sorting comparisons. If you want V2 to be sorted, you need to assemble K2 and V2 into new classes as K2,To participate in the comparison. If you want to customize the collation, the sorted object is implementedWritablecomparable interface, implementing collations in the CompareTo method
comment.
It is also worth mentioning that snappy, which is developed by Google and open source compression algorithm, is the Cloudera official strongly advocated in mapreduce used in the compression algorithm. It is characterized by: in the case of similar compression rate as the Lzo file, the compression and decompression performance can also be greatly improved, but it is not divisible as a mapreduce input.
Extended content:
Cloudera Official Blog to snappy Introduction:
http://blog.cloudera.
program(7)-combiner: User-defined Combiner program (must be implemented in Java)(8)-D: Some properties of the job (formerly-jonconf), specifically:1) Number of mapred.map.tasks:map tasks2) Number of mapred.reduce.tasks:reduce tasks3) Stream.map.input.field.separator/stream.map.output.field.separator:map task input/output numberThe default is \ t for the delimiter.4) Stream.num.map.output.key.fields: Specif
Chapter 2 mapreduce IntroductionAn ideal part size is usually the size of an HDFS block. The execution node of the map task and the storage node of the input data are the same node, and the hadoop performance is optimal (Data Locality optimization, avoid data transmission over the network ).
Mapreduce Process summary: reads a row of data from a file, map function processing, Return key-value pairs; the system sorts the map results. If there are multi
last_key. Now we still use Unix pipe to simulate the entire mapreduce process:
% Cat input/ncdc/sample.txt | ch02/src/main/Ruby/max_temperature_map.rb | \Sort | ch02/src/main/Ruby/max_temperature_performance.rb1949 1111950 22
As you can see, this output is the same as that of Java. Now we use hadoop to run it. Because the hadoop command does not support the streaming option, you must use the
Install times wrong: Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.7:run (site) on project Hadoop-hdfs:an Ant B Uildexception has occured:input file/usr/local/hadoop-2.6.0-stable/hadoop-2.6.0-src/hadoop-hdfs-project/ Hadoop-hdfs/target/findbugsxml.xml
components and relationships of the MAP/reduce framework.2.1 Overall Structure 2.1.1 Mapper and reducer
The most basic components of mapreduce applications running on hadoop include a er and a reducer class, as well as an execution program for creating jobconf, and a combiner class in some applications, it is also the implementation of reducer.2.1.2 jobtracker and tasktracker
They are all scheduled by one
nodes may still is performing several more map tasks.But They also begin exchanging the intermediate outputs from the map tasks to where they are the required by the reducers. This process's moving map outputs to the reducers is known as shuffling.
-Sort
Each reduce task is responsible to reducing the values associated with several intermediate keys. The set of intermediate keys on a single node are automatically sorted by Hadoop before they are pre
percentage of map output record boundaries, and other caches are used to save data • Io. sort. spill. percent • default value: 0.80 • threshold for starting spill operations by map • Io. sort. factor • 10 by default • Maximum number of streams simultaneously operated during merge operations. • Min. num. spill. for. combine • default value 3 • Minimum number of spill run by the combiner function • mapred. compress. map. output • default value false •
expensive operation, and the Combiner class can act as an optimizer to reduce the amount of data moved between tasks. The combo class is absolutely not necessary, and you should consider using them when you absolutely have to squeeze performance out of our mapreduce jobs.
In the last article, we built a simple mapreduce job using C #. But Hadoop is a Java-based platform. So how do we use. NET language to p
. Therefore, we need to customize partition to choose the record reducer according to our own requirements. Custom Partitioner is simple, as long as you customize a class, and inherit the Partitioner class, overriding its Getpartition method is good, when used by calling the job's setpartitionerclass to specify can beThe results of the map will be distributed to reducer via partition. Mapper results, may send to combiner do merge,
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.