1.JVM Reuse
JVM reuse does not mean that two or more tasks of the same job run on the same JVM at the same time, but that n tasks run on the same JVM sequentially , eliminating the time for the JVM to shut down and restart. The n value can be set in the Mapre-site.xml file mapreduce.job.jvm.numtasks(default 1) attribute of Hadoop. Also available in hive execution settings:set mapred.job.reuse.jvm.num.tasks=10; (default 1)
A TT can run up to the same number of tasks by Mapred-site.xml in mapreduce.tasktracker.map.tasks.maximum and Mapreduce.tasktracker.reduce.tasks.maximum settings. Other methods, such as on the jobclient side through the command line :-D mapred.tasktracker.map.tasks.maximum=number or conf.set (" Mapred.tasktracker.map.tasks.maximum "," number ") is set to " Invalid ".
What factors affect the operational efficiency of the job?
Number of mapper : Try to cut the input data into an integer multiple of the data block. If you have too many small files, consider Combinefileinputformat
number of Reducer : In order to achieve maximum performance, the number of reducer in the cluster should be slightly smaller than the number of reducer task slots
combiner use : Fully use the merge function to reduce the amount of data passed between map and reduce, combiner run after map
median compression : compressing the map output value reduces the amount of conf.setcompressmapoutput (true) before reducing to reduce Setmapoutputcompressorclass (Gzipcodec.class)
Custom Writable: If you use a custom writable object or a custom comparator, you must ensure that you have implemented Rawcomparator
adjust the shuffle parameter : MapReduce's shuffle process can adjust some memory management parameters to compensate for poor performance
Hadoop Performance Tuning