1. We know that in big data scenarios, we are not afraid of large amounts of data. What we are afraid of is data skew. How to avoid data skew is particularly critical to find functions that may cause data skew. When the amount of data is large, use count(distinct) with caution. count(distinct) is prone to tilt problems.
2. Set a reasonable number of map reduce tasks
Map stage optimization
mapred.min.split.size: refers to the minimum split unit size of the data; the default value of min is 1B
mapred.max.split.size: refers to the maximum split unit size of the data; the default value of max is 256MB
By adjusting max, you can adjust the number of maps. Decreasing max can increase the number of maps, and increasing max can reduce the number of maps.
It should be reminded that adjusting the parameter mapred.map.tasks directly has no effect.
Examples:
a) Suppose there is a file a in the input directory with a size of 780M, then Hadoop will divide the file a into 7 blocks (6 128m blocks and 1 12m blocks), resulting in 7 maps
b) Assuming that there are 3 files a, b, c in the input directory, the size is 10m, 20m, 130m, then hadoop will be divided into 4 blocks (10m, 20m, 128m, 2m), resulting in 4 maps
That is, if the file is larger than the block size (128m), it will be split, if it is smaller than the block size, the file will be treated as a block.
In fact, this involves the problem of small files: if a task has many small files (far smaller than the block size of 128m), then each small file will also be treated as a block, completed with a map task,
The time for a map task to start and initialize is much longer than the logic processing time, which will cause a lot of waste of resources.
Moreover, the number of maps that can be executed simultaneously is limited. Then the question is coming again. . Is it guaranteed that each map can process file blocks close to 128m, so you can sit back and relax?
The answer is not necessarily. For example, there is a 127m file, which is normally completed with a map, but this file has only one or two small fields, but there are tens of millions of records,
If the logic of map processing is more complicated, it is definitely more time-consuming to do it with a map task.
How should we solve it? ? ?
We need to take two approaches to solve: reducing the number of maps and increasing the number of maps;
Reduce the number of maps
from popt_tbaccountcopy_mes where pt = ‘2012-07-04’;
The inputdir of the task /group/p_sdo_data/p_sdo_data_etl/pt/popt_tbaccountcopy_mes/pt=2012-07-04
There are 194 files, many of which are small files that are much smaller than 128m, with a total size of 9G. Normal execution will use 194 map tasks.
Total computing resources consumed by Map: SLOTS_MILLIS_MAPS = 623,020
I use the following methods to merge small files before map execution to reduce the number of maps:
set mapred.max.split.size=100000000;
set mapred.min.split.size.per.node=100000000;
set mapred.min.split.size.per.rack=100000000;
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat;
Then execute the above statement, using 74 map tasks, the computing resources consumed by map: SLOTS_MILLIS_MAPS = 333,500
For this simple SQL task, the execution time may be similar, but it saves half of the computing resources.
Roughly explain, 100000000 means 100M, set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; this parameter means to merge small files before execution, the first three parameters determine the size of the merged file block, Those larger than the file block size of 128m are separated according to 128m, and those smaller than 128m and greater than 100m are separated according to 100m. Those that are less than 100m (including the small files and the rest of the separated large files) are merged and 74 are finally generated. Piece.
Increase the number of maps
How to properly increase the number of maps?
When the input files are large, the task logic is complex, and the map execution is very slow, you can consider increasing the number of maps.
To reduce the amount of data processed by each map, thereby improving the efficiency of task execution.
Suppose there is such a task:
Select data_desc,
count(1),
count(distinct id),
sum(case when …),
sum(case when ...),
sum(…)
from a group by data_desc
If there is only one file in table a, the size is 120M, but contains tens of millions of records, if you use 1 map to complete this task,
It must be time-consuming. In this case, we have to consider splitting this file into multiple,
This can be done with multiple map tasks.
set mapred.reduce.tasks=10;
create table a_1 as
select * from a
distribute by rand(123);
This will randomly distribute the records of table a into table a_1 containing 10 files, and then use a_1 instead of table a in the above sql,
It will be completed with 10 map tasks.
Each map task processes data larger than 12M (several millions of records), and the efficiency will definitely be much better.
It seems that these two are somewhat contradictory, one is to merge small files, and the other is to split large files into small files,
This is where the focus needs attention,
Make a single map task handle the appropriate amount of data;
Optimization at reduce stage
The number of Reduce has a great influence on the running performance of the entire job. If Reduce is set too large, it will generate many small files,
Will have a certain impact on NameNode,
And the running time of the entire job may not be reduced; if the Reduce is set too small, then the data processed by a single Reduce will increase,
It is likely to cause OOM anomalies.
If the mapred.reduce.tasks/mapreduce.job.reduces parameter is set, Hive will directly use its value as the number of Reduce;
If the value of mapred.reduce.tasks/mapreduce.job.reduces is not set (that is, -1), then Hive will
Estimate the number of Reduce according to the size of the input file.
Estimating the number of Reduce from the input file may not be very accurate, because the input of Reduce is the output of Map, and the output of Map may be smaller than the input,
Therefore, the most accurate number estimates the number of Reduce based on the Map output.
1. How does
Hive determine the number of reduce:
The setting of the number of reduce greatly affects the efficiency of task execution. Without specifying the number of reduce,
Hive will guess and determine the number of reduce, based on the following two settings:
hive.exec.reducers.bytes.per.reducer (the amount of data processed by each reduce task, the default is 1000^3=1G)
hive.exec.reducers.max (maximum reduce number of each task, default is 999)
The formula for calculating the number of reducers is very simple N=min (parameter 2, total input data amount/parameter 1)
That is, if the total size of the input of the reduce (the output of the map) does not exceed 1G, then there will be only one reduce task
For example: select pt,count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt;
/group/p_sdo_data/p_sdo_data_etl/pt/popt_tbaccountcopy_mes/pt=2012-07-04 The total size is more than 9G,
So this sentence has 10 reduce
2. Method one for adjusting the number of reduce:
Adjust the value of hive.exec.reducers.bytes.per.reducer parameter;
set hive.exec.reducers.bytes.per.reducer=500000000; (500M)
select pt,count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt; this time there are 20 reduce
3. Adjust the number of reduce method two
set mapred.reduce.tasks = 15;
select pt,count(1) from popt_tbaccountcopy_mes where pt = '2012-07-04' group by pt; this time there are 15 reduce
4. The number of reduce is not the better
Like map, starting and initializing reduce will consume time and resources;
In addition, there are as many output files as there are reducers. If many small files are generated, then if these small files are used as input for the next task,
Then there will be too many small files;
6. When there is only one reduce;
Many times you will find that no matter how large the amount of data is in the task, whether you have set parameters to adjust the number of reduce
, There is always only one reduce task in the task;
In fact, there is only one reduce task. In addition to the case where the amount of data is less than the value of the parameter hive.exec.reducers.bytes.per.reducer, there are the following reasons:
There is no group by summary, for example, select pt, count(1) from popt_tbaccountcopy_mes where pt = ‘2012-07-04’ group by pt;
Written as select count(1) from popt_tbaccountcopy_mes where pt = ‘2012-07-04’;
This is very common, and I hope you will rewrite it as much as possible.
Order by
Cartesian product
Under these circumstances, I usually have no good solution except finding a way to work around and avoid, because these operations are global
, So hadoop had to use a reduce to complete;
Similarly, when setting the number of reducers, these two principles also need to be considered: make the appropriate reducer number for large amounts of data;
Allow a single reduce task to process an appropriate amount of data;
Merge small files
We know that the small number of files can easily cause bottlenecks on the file storage side, putting pressure on HDFS and affecting processing efficiency.
For this, you can eliminate this effect by merging Map and Reduce result files.
The parameters used to set the merge properties are:
Whether to merge Map output files: hive.merge.mapfiles=true (default is true)
Whether to merge Reduce output files: hive.merge.mapredfiles=false (default is false)
The size of the merged file: hive.merge.size.per.task=256*1000*1000 (default value is 256000000)