ReduceThe number of intermediate or landing files, file size, and blockThe size is independent.
1. deciding factors of reduce count
The setting of the reduce number greatly affects the task execution efficiency. If the reduce number is not specified, hive will guess to determine the number of reduce based on the following two settings:
Parameter 1: hive.exe C. reducers. bytes. Per. Cer CER (the data volume processed by each reduce task. The default value is 1000 ^ 3 = 1G)
Parameter 2: hive.exe C. reducers. Max (the maximum reduce size for each job, which is 999 by default)
Formula for calculating the number of reducers
N = min (Parameter 2, Total input data volume/Parameter 1)
That is, if the total size of Reduce input (map output) cannot exceed 1 GB, only one reduce task will be executed;
Case:
select pt,count(1) from popt_tbaccountcopy_mes where pt = ‘2012-07-04‘ group by pt;
File storage location:/group/p_sdo_data/p_sdo_data_etl/Pt/popt_tbaccountcopy_mes/Pt = 2012-07-04
The total size is more than 9 GB, so this sentence has 10 reduce
Best Practice: it is usually necessary to manually specify the reducer. Considering MapThe output data volume of the stage is usually significantly reduced than the input data, so even if the CER is not setReset hive.exe C. Cipher CERs. MaxIt is also necessary.Based on hadoopExperience, you can set ReducerThe number is set to 0.95 *Tasktracker in ClusterNumber.
2. problems caused by large or small reduce
The number of reducers is too large:
1) generate many small files (the final output file is determined by reduce. The size of a reduce file is irrelevant to the size of the Set block ), if these small files are input as the next job, too many small files need to be merged;
2) Start and initialize reduce consume time and resources;
3) How many reduce files will have as many output files;
A small reduce number causes:
1) execution time;
2) data skew may occur (for example, 500 maps only have 1 reduce );
3. Adjust the reduce quantity. Method 1: Adjust the value of hive.exe C. reducers. bytes. Per. Cer;
Case:
Set hive.exe C. Fetch CERs. bytes. Per. Cer CER = 500000000; (500 m)
select pt,count(1) from popt_tbaccountcopy_mes where pt = ‘2012-07-04‘ group by pt;
This time there are 20 reduce
Method 2: mapred. Reduce. Tasks
In Hive, the default value is 1;
Case:
When multiple jobs share data jobs in a result set, if only one reduce is set, the job has only one output file, when multiple subsequent jobs use the output of the job, only one map can be used for processing, and the speed will be very slow. You can set the reduce number of the shared job to make multiple outputs, the subsequent jobs can be processed in parallel by multiple maps. in the real environment, the number of reduce tasks needs to be set based on the actual data stream;
How can I query the file data volume?
Hadoop FS-du file name
Set the appropriate reduce quantity based on the actual data volume;
Data job of a result set shared by multiple jobsOr a jobMultiple jobsMultiple references. You can set this parameter (reduceTo increase the number of MapQuantity.
If the data of a job after execution is no longer used by subsequent jobs, it does not matter whether to manually specify the reduce number.
Case:
set mapred.reduce.tasks = 15;
select pt,count(1) from popt_tbaccountcopy_mes where pt = ‘2012-07-04‘ group by pt;
This time there are 15 reduce
4. Common scenarios with only one reduce
Most of the time, you will find that no matter how large the data volume is, no matter whether you have set the parameter to adjust the reduce number, there is always only one reduce task in the task (one of the representations of data skew );
1) The data volume is smaller than the value of hive.exe C. extends CERs. bytes. Per. Cer;
2) No summary of group by (group by keys are distributed to different reduce nodes according to different keys );
For example, select PT, count (1) From popt_tbaccountcopy_mes Where Pt = '2017-07-04 'group by PT;
Write as select count (1) From popt_tbaccountcopy_mes Where Pt = '2017-07-04 ';
3) Order by is used (sorting of the entire job, with low performance );
5. Summary of reduce count
When setting the reduce number, you also need to consider these two principles:
1)Make use of appropriate reduce for large data volumesQuantity;
2)Make a single reduceTask Processing appropriate data volume;