Hive query attention and optimization tips

Source: Internet
Author: User
Tags shuffle

Hive is a tool that generates a string resolution that conforms to SQL syntax to generate mapreduce that can be executed on Hadoop. Using hive to design SQL as many of the features of distributed computing as possible differs from traditional relational databases,

So we need to get rid of the original relational database development of some of the inherent thinking.

Basic principles:

1: Try to filter the data as early as possible, reduce the amount of data at each stage, partition the table, and select only the fields that need to be used

Select ... from A

Join B

On a.key = B.key

where a.userid>10

and b.userid<10

and a.dt= ' 20120417 '

and b.dt= ' 20120417 ';

Should be rewritten as:

Select .... from (select .... from A

where dt= ' 201200417 '

and userid>10

) A

Join (SELECT .... from B

where dt= ' 201200417 '

and UserID < 10

) b

on a.key = B.key;

2, the calculation of the history of the database experience (this is said to use different purposes to optimize the use of the method)

Historical library calculation and use, partitioning

3: Try to atomically, try to avoid a SQL containing complex logic

You can use an intermediate table to complete complex logic

4 jion operation of the table to be careful on the left side of the join (TCL now many of the small table on the right side of the join).

Failure to do so can cause significant disk and memory consumption

5: If the number of parts of union all is greater than 2, or if the amount of data in each union is large, it should be split into multiple insert into statements, and the execution time can be increased by 50% during the actual test.

Insert Overwite Table tablename partition (dt= ...)

Select ... from (

Select ... from A

UNION ALL

Select ... from B

UNION ALL

Select ... from C

) R

where ...;

Can be rewritten as:

INSERT INTO table tablename partition (dt= ...)

Select .... from A

WHERE ...;

INSERT INTO table tablename partition (dt= ...)

Select .... from B

WHERE ...;

INSERT INTO table tablename partition (dt= ...)

Select .... from C

WHERE ...;

5: Write SQL to understand the characteristics of the data itself, if there is a join, group operation, you should pay attention to whether there will be data skew

If data skew occurs, the following should be done:

Set hive.exec.reducers.max=200;

Set mapred.reduce.tasks=;---Increase the number of reduce

Set hive.groupby.mapaggr.checkinterval=100000;--This is the group key corresponding to the number of record bars above this value will be split, the value is set according to the specific data volume

Set hive.groupby.skewindata=true; --If the Group by procedure appears italic should be set to True

Set hive.skewjoin.key=100000; --This is the join key corresponding to the number of record bar more than this value will be split, the value is set according to the specific data volume

Set hive.optimize.skewjoin=true;--if it is a join process, the skew should be setting to True

(1) start a job to do as much as possible, a job can be done, not two jobs to do it

In general, the previous task start can be done with a little bit of work together, so that subsequent multiple task reuse, closely linked to the model design, good model is particularly important.

(2) reasonable set reduce number

Reduce the number of too few do not really play the power of Hadoop parallel computing, but reduce the number of too many, will cause a lot of small file problems, data volume, resource situation only their own clearest, find a compromise point,

(3) use the Hive.exec.parallel parameter to control whether different jobs in the same SQL can run concurrently, increasing the concurrency of the job

2, let the server as little as possible to do things, take the best path to the minimum resource consumption as the goal

Like what:

(1) Notice the use of join

If one of these tables is small use a map join, or use a normal reduce join, note that hive will load the table data in front of the join into memory, so the smaller table is less expensive than the memory resource before the larger table

(2) Attention to the problem of small files

There are two more common ways of handling this in hive.

The first is to use Combinefileinputformat to package multiple small files as a whole inputsplit, reducing the number of map tasks

Set mapred.max.split.size=256000000;

Set mapred.min.split.size.per.node=256000000

Set mapred.min.split.size.per.rack=256000000

Set Hive.input.format=org.apache.hadoop.hive.ql.io.combinehiveinputformat

The second is to set the hive parameter, which will start an additional Mr Job to package small files

Hive.merge.mapredfiles = False if the Reduce output file is merged, the default is False

Hive.merge.size.per.task = 256*1000*1000 the size of the merged file

(3) Note Data Skew

A more common approach in hive

First, two Mr Jobs are generated through hive.groupby.skewindata=true control, and the output of the first Mr Job map is randomly assigned to reduce for pre-summarization, reducing the data skew problem caused by excessive number of key value bars and some key bars too small

The second through HIVE.MAP.AGGR = True (the default is true) on the map side do combiner, if the map of the data is basically different, aggregation is meaningless, do combiner instead of Lily, It's also considered more thoughtful in hive. by parameter Hive.groupby.mapaggr.checkinterval = 100000 (default) hive.map.aggr.hash.min.reduction=0.5 (default), Pre-fetch 100,000 data aggregations, if the aggregated number of bars/100000>0.5, then no longer aggregates

(4) Use multi insert,union All

Multi Insert is suitable for inserting different tables based on the same source table according to different granularity processing, so that only the source table is scanned once, the number of jobs is unchanged, and the number of source table scans is reduced.

Union All is good, can reduce the number of scans of the table, reduce the number of jobs, usually pre-generated by different logic different conditions of the query union all, and then unified Group by calculation, the different tables of the union all equivalent to multiple inputs, UNION ALL of the same table, fairly map output multiple bars at a time

(5) Tuning of parameter settings

There are a variety of cluster parameters, for example

Specific parameters can be set for a specific job, such as JVM reuse, reduce copy thread number settings (for faster map and larger output)

If the number of tasks is large and small, such as within a minute, reduce the number of tasks to reduce the consumption of task initialization. You can reduce the consumption of tasks by configuring JVM reuse options

-----------------------------------------------------------

I. Controlling the number of map and reduce in hive

SQL queries in hive generate execution plans, execution plans are executed in a mapreduce manner, and the number of maps and reduce affects the efficiency of SQL execution, combined with the size of the data and the cluster.

In addition to controlling the number of jobs generated by hive, you should also control the number of maps and reduce.

1, the number of maps, usually and split size has a relationship, previously written a blog "map and reduce the number of how to define" there is a description.

The default Hive.input.format in hive is Org.apache.hadoop.hive.ql.io.CombineHiveInputFormat, and for Combinehiveinputformat, The number of maps it has entered

Determined by three configurations,

Mapred.min.split.size.per.node, at least the size of split on a node

Mapred.min.split.size.per.rack at least the size of split under a switch

Mapred.max.split.size a split the largest size

The main idea is to divide the large files in the input directory into multiple map inputs and merge the small files into a map input. The specific principle is the following three steps:

A, according to each file in the input directory, if its length exceeds mapred.max.split.size, the block is divided into multiple split (a split is the input of a map), Each split has a length greater than mapred.max.split.size because it is in block and therefore larger than blocksize, and if the remaining length of this file is greater than Mapred.min.split.size.per.node, Generates a split, otherwise it is temporarily retained.

B. Now all that is left is some short-length fragments, merging each rack under a fragment, as long as the length exceeds mapred.max.split.size and merges into a split, Finally, if the remaining fragments are larger than the Mapred.min.split.size.per.rack, merge them into a split, otherwise temporarily reserved.

C, the fragments under different rack merge, as long as the length of more than Mapred.max.split.size merged into a split, the remaining fragments, regardless of length, merged into a split.

Example: mapred.max.split.size=1000

mapred.min.split.size.per.node=300

mapred.min.split.size.per.rack=100

Enter the directory under five files, Rack1 under three files, the length is 2050,1499,10, Rack2 under two files, the length is 1010,80. In addition the BlockSize is 500.

After the first step, generate five split:1000,1000,1000,499,1000. The remaining fragments are rack1 under: 50, 10; Rack2 Next 10:80

Since the fragments of the two rack are not more than 100, the split and the fragments have not changed after the second step.

In the third step, merge four fragments into a split with a length of 150.

If you want to reduce the number of maps, you can adjust the mapred.max.split.size, or smaller.

It is characterized by: a block at most as a map input, a file may have more than one block, a file may be because the block to make a different map input, a map may handle multiple blocks, possibly processing multiple files.

2. Number of reduce

When you run SQL in hive, you can print it as follows:

Number of reduce tasks not specified. Estimated from input data size:1in order to change the average load for a reducer (in bytes): Set Hive.exec.reducers.byt Es.per.reducer=<number>in order to limit the maximum number of Reducers:set hive.exec.reducers.max=<number> In order to set a constant number of Reducers:set mapred.reduce.tasks=<number>
The number of reduce is determined by the following three parameters,

Mapred.reduce.tasks (the number of tasks that have been forced to specify reduce)

Hive.exec.reducers.bytes.per.reducer (the amount of data processed per reduce task, default is 1000^3=1g)

Hive.exec.reducers.max (the maximum number of reduce per task, default is 999)

The formula for calculating the number of reducer is simple n=min (hive.exec.reducers.max, total input data volume/hive.exec.reducers.bytes.per.reducer)

There is only one scenario for reduce: A, no GROUP by summary B, order by C, Cartesian product two, join, and group optimization for common join operations, shuffle to a certain reduc on the map side based on the hash value of key e up, in the reduce side to do join operation, in-memory cache joins to the left of the table, traverse the right table, one time to do a join operation.       So when you do a join operation, put a table with a lot of data on the right side of the join. When the amount of data is large, and the key is unevenly distributed, a large number of keys are shuffle to a reduce, and the data is tilted.
For group operations, the first aggregation on the map side, and finally in the reduce sit aggregation, hive default is this, the following are the relevant parameters
· Hive.map.aggr = True if aggregation is done on the map side, which defaults to True
·        Hive.groupby.mapaggr.checkinterval = 100000 The number of entries that are aggregated at the Map end can have data skew for join and group operations. Here are a few common ideas to solve this problem 1, parameter Hive.groupby.skewindata = True, solve the data tilt of the master key, the query plan will have two MR jobs. In the first MR Job, the output set of MAP is randomly distributed to reduce, and each reduce does a partial aggregation operation and outputs the result so that the same Group by Key may be distributed to different reduce to achieve load balancing purposes;      Two MR jobs are then distributed to reduce based on the preprocessed data results (this process guarantees that the same group by key is distributed to the same reduce), and finally the final aggregation operation is completed. 2, where the conditions are written in the join, so that the number of join (through the map end filter, only output compound conditions) 3, Mapjoin mode, no reduce operation, on the map side do join operation (map side cache Small table of all data), In this way, you cannot perform the Full/right OUTER join Operation 4, for the count (distinct) operation, the Group by field and the Count field in the map side as key, if there are a large number of identical keys, then there will be data skew problem
5, the data tilt also includes, a large number of join connection key is empty case, empty key is hash to a reduce up, solve this problem, it is best to the empty key and non-empty key to distinguish
Empty key does not do a join operation. Of course, there are hive operations, there is no data skew problem, such as data aggregation class operations, such as Sum, count, because it has been done on the map end of the aggregation operation, to reduce the data is relatively small, so there is no problem. Small files combined with a large number of small files resulting in an excessive number of files, the pressure on HDFs, the efficiency of the hive processing is relatively large, you can merge the files generated by map and reduce · Hive.merge.mapfiles = True if and Map output file, default to True
· Hive.merge.mapredfiles = False if the Reduce output file is merged, the default is False
· Hive.merge.size.per.task = 256*1000*1000 the size of the merged file five, in/exists (not) implements the in operation through the left semi join, and one restriction is that the table to the right of the join can only appear in the join bar        Part VI, partition clipping by specifying a partition in the criteria to limit the scope of the data scan, you can greatly improve the efficiency of the query seven, sort order by sort, there is only one reduce, so the efficiency is low. You can use the sort by operation, usually combined with distribute by, to make the reduce partition key

Transferred from: http://blog.csdn.net/joe_007/article/details/8987422

Hive query attention and optimization tips

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.