Hive General learners and trainers in the performance optimization of the general will be from the syntax and parameters of these tricks point of view, but not revolutionary optimization of hive performance, the reasons for this phenomenon are:
1, historical reasons and stereotypes: When you learn SQL is generally a stand-alone db, this time your performance optimization skills are mainly SQL syntax and parameters tuning;
2,hive's core performance problems are often generated by exceeding the scale data set, such as the 10 billion-level dataset, and processing thousands of hive operations per day;
The 2nd above is we now hive Performance tuning part of the content to be thoroughly resolved;
In order to solve and significantly solve the real performance optimization problem of hive in real enterprise, we must consider what is the limit of hive performance, according to the priority:
The first important thing is: strategic structure
To solve the problem of Io, which is too frequent in mass data, the problem involves the structure of the table data reuse and the way of partition table.
Supplement: 1, some data in the massive data is the high frequency use data, and some are rarely used, if can be separated into different tables, will greatly enhance the efficiency; many jobs may have common ground, pumping out first to calculate and retain the results of the calculation, the subsequent work can be reused, while the underlying basic functions
can also be calculated first, in the upper application of the data directly to the results, rather than repeat the calculation every time;
2, reasonable from static partition table and dynamic partition table, can avoid the data global scan and compute resources more reasonable use;
3. One-stop solution for data skew;
The second important thing is: the engine and the physical level, a lot of content is ordinary hive Use this do not know!
From the hive grammar and job internal perspective to optimize, which requires mapreduce and hive how to be translated into mapreduce to be very proficient;
The third important thing is: some key parameters;
In the final analysis, Hive's performance optimization is mainly to consider how to maximize and most effectively use CPU Memory IO;
Hive behind the Mapper tuning:
1,mapper number is too large, will produce a large number of small files, because the Mapper is based on virtual machines, too much mapper create and initialize and shut down the virtual machine will consume a lot of hardware resources;
The number of mapper is too small, the concurrency is too small, the job execution time is too long to make full use of distributed hardware resources;
What is the decision to 2,mapper data?
Number of input files;
The size of the input file;
Configuration parameters;
By default: For example, a file 800m,block size is 128M, then the number of Mapper is 7, 6 mapper processing data is 128M, 1 mapper processing data is 32M; For example, there are three files in a directory, respectively, size 5M 10M 150M
4 mapper will be generated at this time, and the data processed are 5M 10M 128M 22M respectively;
Reduce the number of mapper, it is necessary to merge small files, this small file may be directly from the data source of small files, or may be reducer produced small files;
Set Hive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat;
Set hive.merge.mapfiles=true;
Set hive.merge.mapredfiles=true;
Set hive.merge.size.per.task=256000000
Set mapred.max.split.size=256000000
Set mapred.min.split.size.per.node=128000000
Increasing the number of mapper is typically controlled by controlling the number of reducer in the hive SQL, such as when a join operation breaks multiple tables into multiple jobs;
Set mapred.map.tasks=2;
Set hive.merge.mapfiles=true;
Set hive.merge.mapredfiles=true;
Set hive.merge.size.per.task=256000000
For example, we have 5 300M of files, according to the above configuration will produce 10 mapper,5 a mapper processing is 256M of data, the other 5 mapper processing is 44M data, the problem is: Large mapper will be data skew
How to solve, setting set mapred.map.tasks=6, at this time according to mapred operation Mechanism, will divide 6 mapper, each mapper processing data size is 250M, min (1500M/6, 256M) =250m
Hive behind the Reducer tuning:
If the number of 1,reducer is too large, there will be a lot of small files, each reducer will produce a file, if these small files are the next job input, you will need to merge the small files, the same start to initialize and destroy the reducer virtual machine also need to consume a lot of hardware;
Reducer data is too small, reduce time will be longer, may also appear data skew;
2, how to control the number of reducer?
Set hive.exec.reducers.byte.per.reducer=1g
Set hive.exec.reducers.max=999
Reducer Number of =min (999, reducer data input total/1g);
Set mapred.reduce.tasks = 10, default is 1; If the results of the current reducer are large and are being used by the next multiple jobs, how do we set the parameters? It is generally necessary to adjust the parameters;
Under what circumstances is there only one reducer? If you do not group by but you need to summarize, or order by, of course, if the last reducer data is less than the default 1G, there will be only one reducer;
1,hive is the most feared in the distributed operation is the data skew, this is due to the characteristics of the distributed system, because the distributed system is very quickly because the assignment to the different nodes, different nodes together, so as to achieve faster processing of the task;
By the way, the ability to handle data skew is one of the core competencies of Hadoop and spark engineers;
Reasons for data skew in 2,hive:
The data is distributed unevenly on distributed nodes;
Some keys may be particularly large when join;
A key may be particularly high when groupby;
COUNT (distinct) has the potential to skew data because the GroupBy operation is first in the interior;
3,join, we want to join the key is decentralized, if a key data volume is particularly large, there may be data skew and oom, a core point is: Small table join large table, the small table on the left side of reduce stage will be loaded into memory, reduce the risk of oom;
4, large table join large table situation: Data skew, such as null value, the solution is generally to break the null value, such as the use of random numbers, etc., if the data skew more serious, this way can be raised at least one times the speed;
5,mapjoin: Small table join (super) Large table, you can use the Mapjoin way to load the small table all to the mapper end of Memory/*+mapjoin (TABLE_NAME) * *;
6, Small table join (super) when the large table, will automatically mapjoin, want to Mapjoin, need to set: Set Hive.auto.convert.join=true, Hive the size of the left table to determine whether to mapjoin when the join is being performed:
Set hive.mapjoin.smalltable.filesize=128000000;
Set hive.mapjoin.cache.numrows=100000;
The above parameters can be adjusted according to the actual memory of the hardware machine, which has a vital influence on the performance, because there is no shuffle;
How much memory can we use in the mapper-side JVM for Mapjoin?
Set hive.mapjoin.followby.gby.localtask.max.momery.usage = 0.8
Set hive.mapjoin.localtask.max.memory.uage=0.9
7,groupby, we can set the partial aggregation at the mapper end, and finally make the global aggregation at the reducer end.
Set hive.map.aggr=true;
Set hive.groupby.mapaggr.checkinterval=100000
Set Hive.groupby.skewindata = True generates two jobs internally, the first job breaks the tilted key through its own algorithm and aggregates and retains results, and the second job completes all groupby operations, will produce a mapper-reducer-reducer structure.
8, COUNT (distinct), if a field is particularly many, easy to generate data skew, solve the idea:
Filter For example NULL in a query statement, add 1 to the result
9, Cartesian product: The join time does not have on condition, or on condition is invalid, this time will use reducer to carry on the Cartesian product operation;