Hive (five) –hive optimization

Last Update:2015-03-18 Source: Internet

Author: User

Keywords join user 1000 that

Tags configuration control controlled data default different distributed example

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Hive is optimized for different queries, and optimization can be controlled by configuration, this article will introduce some of the optimization strategies and optimization control options.

row clipping (column pruning)

While reading the data, read only the columns that are needed in the query, ignoring the other columns. For example, for a query:

SELECT a,b from T WHERE e < 10;

Where T contains 5 columns (a,b,c,d,e), column c,d will be ignored and only read columns A, B, and E

This option defaults to true: HIVE.OPTIMIZE.CP = True

partition cropping (Partition pruning)

Reduce unnecessary partitions during the query. For example, for the following query:

SELECT * FROM (select C1, COUNT (1) from T GROUP by C1) SUBQ WHERE subq.prtn = 100; SELECT * FROM T1 JOIN (SELECT * from T2) Subq on (t1.c1=subq.c2) WHERE subq.prtn = 100;

SUBQ.PRTN = 100 items are considered in the subquery, reducing the number of partitions read.

This option defaults to true: Hive.optimize.pruner=true

Join

There is a principle when you use a query statement that has a JOIN operation: You should place a table/subquery with fewer entries on the left side of the join operator. The reason is that in the reduce phase of a join operation, the contents of the table located on the left side of the join operator are loaded into memory, and the table with fewer entries on the left can effectively reduce the chance of OOM errors.

For a single statement with multiple joins, if the join is in the same condition, such as a query:

INSERT OVERWRITE TABLE pv_users SELECT Pv.pageid, u.age from Page_view p join user U "(pv.userid = U.userid) join NewUser x on (U . UserID = X.userid); If the key of the Join is the same, no matter how many tables, it will be merged into a map-reduce a map-reduce task, rather than the ' n ' when doing OUTER Join

If the conditions of the Join are not the same, for example:

INSERT OVERWRITE TABLE pv_users SELECT Pv.pageid, u.age from Page_view p join user u on (pv.userid = U.userid) Join NewUser X On (u.age = x.age);

The number of Map-reduce tasks corresponds to the number of Join operations, and the query above is equivalent to the following query:

INSERT OVERWRITE TABLE tmptable SELECT * from Page_view P JOIN user u on (pv.userid = U.userid); INSERT OVERWRITE TABLE pv_users SELECT X.pageid, x.age from tmptable x JOIN newuser y on (x.age = y.age); Map Join

The Join operation completes in the map phase and no longer requires reduce, as long as the required data is accessible in the process of the map. For example, query:

INSERT OVERWRITE TABLE pv_users SELECT/*+ mapjoin (PV)/Pv.pageid, u.age from Page_view pv JOIN user u on (pv.userid = u.us Erid);

You can complete the Join in the map phase, as shown in the figure:

The related parameters are:

Hive.join.emit.interval = 1000 How many rows into the Right-most join operand hive The should buffer unreported the join result. Hive.mapjoin.size.key = 10000 Hive.mapjoin.cache.numrows = 10000 Group Bymap-side partial aggregation: Not all aggregation operations need to be done at the Reduce end, and many aggregation operations can be performed first The Map end is partially aggregated, and finally the end result is obtained at the Reduce end. Based on the Hash parameter includes: Hive.map.aggr = True whether aggregation on the map side, default to True hive.groupby.mapaggr.checkinterval = 100000 number of entries on the map side for aggregation operations Load balancing when there is data skew Hive.groupby.skewindata = False The selected item is set to True, and the resulting query plan has two MR jobs. In the first MR Job, the MAP's output collection is randomly distributed to reduce, with each reduce doing a partial aggregation and outputting the result, so that the same Group by Key may be distributed to different Reduce to achieve load balancing purposes; The MR Job then distributes the group by key to reduce according to the preprocessed data results (this process guarantees that the same Group by key is distributed to the same Reduce), and finally completes the final aggregation operation. Merging small Files

Excessive number of files can put pressure on HDFS and affect processing efficiency by merging the MAP and Reduce results files to eliminate such effects:

Hive.merge.mapfiles = True if and Map output file, default to True Hive.merge.mapredfiles = False to merge the Reduce output file, default to False Hive.merge.size.per.task = 256*1000*1000 the size of the merged file

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More