First, control the number of maps in the hive task:
1. Typically, the job produces one or more map tasks through the directory of input.
The main determinants are: The total number of input files, the file size of input, the size of the file block set by the cluster (currently 128M, can be set dfs.block.size in hive; command to see, this parameter can not be customized modification);
2. For example:
A) Assuming that the input directory has 1 file A and a size of 780M, then Hadoop separates the file a into 7 blocks (6 128m blocks and one 12m block), resulting in 7 map numbers
b) Assuming that the input directory has 3 files A,b,c, the size is 10m,20m,130m, then Hadoop will be separated into 4 blocks (10m,20m,128m,2m), resulting in 4 map numbers
That is, if the file is larger than the block size (128m), then it is split, and if it is less than the block size, the file is considered a block.
3. The more map number, the better.
The answer is in the negative. If a task has a lot of small files (much smaller than the block size of 128m), then each small file will also be treated as a block, with a map task to complete, and a map task start and initialization time is much larger than the logical processing time, it will cause a lot of waste of resources. Also, the number of maps that can be executed at the same time is limited.
4. It's easy to be sure that each map handles close to 128m of file blocks.
The answer is not necessarily. For example, there is a 127m file, the normal will use a map to complete, but this file only one or two small small paragraph, but there are tens of millions of records, if the map processing logic is more complex, with a map task to do, it is certainly more time-consuming.
For the above questions 3 and 4, we need to take two ways to solve: Reduce the number of maps and increase the number of maps;
How to combine small files to reduce the number of maps.
Suppose a SQL task:
Select count (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ';
The inputdir/group/p_sdo_data/p_sdo_data_etl/pt/popt_tbaccountcopy_mes/pt=2012-07-04 of the task
A total of 194 files, many of which are much smaller than 128m of small files, the total size of 9G, normal execution will use 194 map tasks.
Total compute resources consumed by map: slots_millis_maps= 623,020
I can reduce the number of maps by merging small files before map execution by using the following methods:
Set mapred.max.split.size=100000000;
Set mapred.min.split.size.per.node=100000000;
Set mapred.min.split.size.per.rack=100000000;
Set Hive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat;
Then execute the above statement, using 74 map tasks, map consumes the compute resources: slots_millis_maps= 333,500
For this simple SQL task, the execution time may be similar, but half of the computing resources are saved.
Probably explain, 100000000 means 100M, set Hive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat; This parameter indicates a small file merge before execution,
The previous three parameters determine the size of the merged file block, larger than the file block size of 128m, separated by 128m, less than 128m, greater than 100m, separated by 100m, to those less than 100m (including small files and separate large files remaining),
Merged, resulting in a total of 74 blocks.
How to properly increase the number of maps.
When the input file is very large, the task logic is complex, map execution is very slow, you can consider increasing the number of maps, so that the amount of data processed by each map is reduced, thus improving the efficiency of the task execution.
Suppose you have a task like this:
Select Data_desc,
Count (1),
COUNT (distinct ID),
SUM (case when ...),
SUM (case when ...),
Sum (...)
From a GROUP by Data_desc
If the table A has only one file, the size is 120M, but contains tens of millions of of records, if the use of 1 map to complete this task, it is certainly more time-consuming, in this case, we have to consider the reasonable splitting of this file into multiple,
This allows you to use multiple map tasks to complete.
Set mapred.reduce.tasks=10;
CREATE TABLE A_1 as
SELECT * from a
Distribute by RAND (123);
This will be a table of records, randomly scattered into the a_1 table containing 10 files, and then replaced by a_1 in the SQL table A, you will use 10 map tasks to complete.
Each map task handles more than 12M (millions of records) of data, which is certainly much more efficient.
Looks like these two kinds of contradictions, one is to merge small files, one is to take large files into small files, this is the focus of attention, this is the place to
According to the actual situation, the control map quantity needs to follow two principle: Make the Big Data quantity use the appropriate map number, make the individual map task handle the appropriate data quantity;
Second, reduce the number of control hive tasks:
1. How to determine the reduce number for hive itself:
The setting of reduce number greatly affects the task execution efficiency, without specifying the number of reduce, hive will guess to determine a reduce number, based on the following two settings:
Hive.exec.reducers.bytes.per.reducer (the amount of data processed per reduce task, default is 1000^3=1g)
Hive.exec.reducers.max (the maximum number of reduce per task, default is 999)
The formula for calculating the number of reducer is simple n=min (parameter 2, total input data volume/parameter 1)
That is, if the total size of the input (map output) of reduce is not more than 1G, then there will only be one reduce task;
such as: Select Pt,count (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ' GROUP by PT;
/GROUP/P_SDO_DATA/P_SDO_DATA_ETL/PT/POPT_TBACCOUNTCOPY_MES/PT=2012-07-04 has a total size of 9G, so this sentence has 10 reduce
2. Adjust the reduce number method one:
Adjust the value of the Hive.exec.reducers.bytes.per.reducer parameter;
Set hive.exec.reducers.bytes.per.reducer=500000000; (500M)
Select Pt,count (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ' GROUP by PT; This time, there are 20 of reduce
3. Adjust reduce number method two;
Set mapred.reduce.tasks = 15;
Select Pt,count (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ' GROUP by PT; This time there are 15 reduce
4. The number of reduce is not the more the better;
As with map, the startup and initialization of reduce also consumes time and resources;
In addition, how many reduce, there will be the number of output files, if you generate a lot of small files, then if these small files as the next task input, there will be too many small files problem;
5. Under what circumstances there is only one reduce;
Most of the time you will find that regardless of the amount of data in the task, regardless of whether you have set parameters to adjust the number of reduce, the task has been only a reduce task; in fact, there is only one reduce task, In addition to the fact that the amount of data is less than the Hive.exec.reducers.bytes.per.reducer parameter value, there are the following reasons:
A) There is no group by summary, such as the Select Pt,count (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ' GROUP by PT; Written as select COUNT (1) from popt_tbaccountcopy_mes where pt = ' 2012-07-04 ';
This is very common and I hope you will try to rewrite it.
b) using the ORDER by
c) with Cartesian product
Usually in these cases, in addition to find a way to work around and avoid, I have no good way, because these operations are global, so Hadoop had to use a reduce to complete;
Similarly, these two principles need to be taken into account when setting the number of reduce: to make large data use the appropriate reduce number, and to make a single reduce task handle the appropriate amount of data;
Pending study:
The number of
Maps is usually determined by the size of the DFS block of the Hadoop cluster, that is, the total number of blocks of the input file, the normal number of parallel scale of the map is approximately each node is 10~100, for the CPU consumes a small job can set the number of maps is about 300, However, since none of Hadoop's tasks will take time to initialize, it is reasonable to have at least 1 minutes per map execution time. The specific data shard is this, inputformat by default will be based on the Hadoop cluster DFS block size shard, each shard will be processed by a map task, Of course, the user can still use the parameter mapred.min.split.size parameter to make custom settings in the job submission client. Another important parameter is Mapred.map.tasks, which sets the number of maps is only a hint, only if InputFormat determines the number of map tasks than Mapred.map.tasks value hours. Similarly, the number of map tasks can be set manually by using the jobconf conf.setnummaptasks (int num) method. This method can be used to increase the number of map tasks, but the number of tasks can not be set less than the Hadoop system by dividing the input data obtained by the value. Of course, in order to improve the concurrency of the cluster, you can set a default number of maps, when the user's map number is small or more than the self-segmentation of the value of the hour can use a relatively large default value, thereby improving the efficiency of the overall Hadoop cluster.