Hive optimization Summary

Source: Internet
Author: User
Tags subq

1. When hive executes join memory overflow, you can modify the hive configuration file hive-site.xml, increase the memory, as shown below: mapred. Child. java. opts-xmx 1024 m

2. the path to hive Default table creation can also be configured in the hive-site.xml, as follows: hive. metaStore. warehouse. dir value>/user/hive/warehouse description> location of default database for the warehouse

3. When performing the join operation, try to put a small table in front of the table. If a large table is in front of the table, it may cause an error due to memory overflow.

4. to operate a partition table, you need to filter the partitions (for example, DS = $ yday ). Especially in the join operation, partition filtering (such as DS = $ yday) needs to be put into the on statement or subquery. It cannot be placed in the WHERE clause after the on clause. This will scan all the tables before determining the partition. That is to say, the program will first execute the join operation to execute the final where operation.

5. In the join operation, only one mapreduce operation is executed when the same field is joined consecutively. Select * from a left Outer Join B on. T = B. T left Outer Join C on. T = C. t; Recommended select * from a left Outer Join B on. T = B. T left Outer Join C on B. T = C. t; inefficient

6. when a large table and a small table are joined, mapjoin is used to read small tables into memory for join, select/* + mapjoin (a) */. c1, B. c2, B. c3 from a join B on. c4 = B. c4;

7. By setting hive. Merge. mapfiles, you can disable hive Optimization for table scanning, but it sometimes improves the efficiency. The default value is true. It can be set as needed: only select statements or mapjoin statements are recommended.

8. alter table a set serdeproperties ('serialization. null. format' = ''); The \ n string does not appear in the result table, instead of an empty string.

Column pruning)

When reading data, read only the columns required for the query, and ignore other columns. For example, for query:

SELECT a,b FROM T WHERE e < 10;

T contains five columns (A, B, C, D, E), column C, and column D are ignored, and only columns A, B, and E are read.

This option is true by default:Hive. Optimize. CP = true

Partition pruning)

Reduce Unnecessary partitions in the query process. For example, for the following queries:

SELECT * FROM (SELECT c1, COUNT(1) FROM T GROUP BY c1) subq WHERE subq.prtn = 100; SELECT * FROM T1 JOIN (SELECT * FROM T2) subq ON (T1.c1=subq.c2) WHERE subq.prtn = 100;

Subq. prtn = 100 is considered in the subquery to reduce the number of read partitions.

This option is true by default:Hive. Optimize. pruner = true

Join

When using a query statement with a join operation, you should place a table or subquery with fewer entries on the left of the join operator. The reason is that in the reduce stage of the join operation, the content of the table on the left of the join operator is loaded into the memory, and the table with fewer entries is placed on the left, which can effectively reduce the chance of OOM errors.

If multiple join conditions are the same in a statement, such as query:

INSERT OVERWRITE TABLE pv_users SELECT pv.pageid, u.age FROM page_view p JOIN user u ON (pv.userid = u.userid) JOIN newuser x ON (u.userid = x.userid);
  • If the join key is the same, no matter how many tables are there, it will be merged into a map-Reduce
  • A map-Reduce task, not 'n'
  • The same is true for outer join.

If the join conditions are different, for example:

INSERT OVERWRITE TABLE pv_users SELECT pv.pageid, u.age FROM page_view p JOIN user u ON (pv.userid = u.userid) JOIN newuser x on (u.age = x.age);

The number of Map-reduce tasks corresponds to the number of join operations. The preceding query is equivalent to the following query:

INSERT OVERWRITE TABLE tmptable SELECT * FROM page_view p JOIN user u ON (pv.userid = u.userid); INSERT OVERWRITE TABLE pv_users SELECT x.pageid, x.age FROM tmptable x JOIN newuser y ON (x.age = y.age);
Map join

The join operation is completed in the map stage and does not require reduce. The precondition is that the required data can be accessed during the map process. For example, query:

INSERT OVERWRITE TABLE pv_users SELECT /*+ MAPJOIN(pv) */ pv.pageid, u.age FROM page_view pv JOIN user u ON (pv.userid = u.userid);

Join can be completed in the map stage ,:

Related parameters:

  • Hive. Join. emit. interval = 1000.How many rows in the right-most join operand hive shocould buffer before emitting the join result.
  • Hive. mapjoin. Size. Key = 10000
  • Hive. mapjoin. cache. numrows = 10000
Group
  • Partial aggregation on map:

    • Not all aggregation operations need to be completed at the reduce end. Many aggregation operations can be partially aggregated at the map end, and finally the final result is obtained at the reduce end.
    • Based on Hash
    • Parameters include:
      • Hive. Map. aggr = trueWhether to perform aggregation on the map end. The default value is true.
      • Hive. groupby. mapaggr. checkinterval = 100000Number of entries for aggregation on the map end
  • Load Balancing when data is skewed
    • Hive. groupby. skewindata = false
    • When the option is set to true, the generated query plan has two Mr Jobs. In the first Mr job, the map output result set is randomly distributed to reduce. Each reduce performs partial aggregation and outputs the result, in this way, the same group by key may be distributed to different reduce tasks to achieve load balancing; the second Mr job is then distributed to reduce according to the pre-processed data results according to the group by key (this process can ensure that the same group by key is distributed to the same reduce ), finally, the final aggregation operation is completed.
Merge small files

An excessive number of files puts pressure on HDFS and affects processing efficiency. You can merge the map and reduce result files to eliminate this impact:

  • Hive. Merge. mapfiles = trueWhether to match and map the output file. The default value is true.
  • Hive. Merge. mapredfiles = falseWhether to merge reduce output files. The default value is false.
  • Hive. Merge. Size. Per. Task = 256*1000*1000Size of the merged File

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.