Hive optimization tips-How to Write HQL

Source: Internet
Author: User

Hive optimization tips-How to Write HQL

I. Hive join Optimization
 
1. try to place the small table on the left of join. The hive-0.12.0 we use here is automatically converted. This means that the small table is automatically loaded into the memory and the map side join is executed (with good performance ), this is done by the hive parameter. auto. convert. join = true and hive. smalltable. filesize = 25000000L) parameter control (25 MB by default). If the table file size is around 25 MB, You can adjust this parameter to perform map side join to avoid reduce side join. You can also display the Declaration for map join: This is especially suitable for small tables to join large tables, SELECT/* + MAPJOIN (B) */. key,. value FROM a join B on. key = B. key
 
2. Pay attention to join with table partitions, such:
 
SELECT. val, B. val FROM a left outer join B ON (. key = B. key) WHERE. dt = '2014-08-07 'AND B. dt = '2017-08-07'
 
Because hive first joins and then where, if the records in Table a cannot be found in Table B, all columns in Table B will list null, including the ds column, the query result of left outer is irrelevant to the where clause. solution:
 
SELECT. val, B. val FROM a left outer join B ON (. key = B. key AND. dt = '2014-08-07 'AND B. dt = '2017-08-07 '')
 
3. How to Write the exist/in clause?
 
Hive does not support subqueries in THE where clause. The exist in Clause commonly used in SQL must be rewritten. This rewrite is relatively simple. Consider the following SQL query statements:
 
SELECT a. key, a. value FROM a WHERE a. key in (SELECT B. key from B );
 
It can be rewritten
 
SELECT a. key, a. value FROM a left outer join B ON (a. key = B. key) WHERE B. key <> NULL;
 
A more efficient implementation is to use left semi join to rewrite:
 
SELECT a. key, a. val FROM a left semi join B on (a. key = B. key );
 
4. Hive join only supports equijoin and does not support non-equijoin.
 
5. reasonably use map join. occasion: small table A join large table,

Hadoop cluster-based Hive Installation

Differences between Hive internal tables and external tables

Hadoop + Hive + Map + reduce cluster installation and deployment

Install in Hive local standalone Mode

WordCount word statistics for Hive Learning

Hive operating architecture and configuration and deployment
 
2. Reasonably set the number of map and reduce tasks.
 
1. How to merge small files and reduce the number of maps?
 
If the number of maps in a table is very large, it may be because of the large number of files and the small size of the files, you can perform the following operations to merge the files ,:
 
Set mapred. max. split. size = 100000000; // 100 M
 
Set mapred. min. split. size. per. node = 100000000;
 
Set mapred. min. split. size. per. rack = 100000000;
 
Set hive. input. format = org. apache. hadoop. hive. ql. io. CombineHiveInputFormat; // merge small files
 
2. How to increase the number of maps?
 
If table A has only one file, which is 120 MB in size and contains tens of millions of records, you can use multiple map tasks to complete it.
 
Set mapred. reduce. tasks = 10;
 
Create table a_1
 
Select * from
 
Distribute by rand (123); // records in Table a are randomly distributed to the_1 table containing 10 files.
 
3. How does hive determine the reduce quantity? The reduce quantity is set based on the following parameters:
 
Hive.exe c. reducers. bytes. per. Cer CER (the data volume processed by each reduce task. The default value is 1000 ^ 3 = 1G)
 
Hive.exe c. reducers. max (the maximum number of reduce tasks. The default value is 999)
 
The formula for calculating the number of reducers is simple: N = min (parameter 2, total input data volume/parameter 1)
 
That is, if the total size of reduce input (map output) does not exceed 1 GB, only one reduce task is available. Therefore, adjust the following parameters:
 
Set hive.exe c. Fetch CERs. bytes. per. Cer CER = 500000000; (500 M)
 
Set mapred. reduce. tasks = 15;
 
3. If you design and use a bucket,
 
Buckets computes the hash of a specified column and splits the data based on the hash value. The objective is to parallel, and each Bucket corresponds to a file. Distribute the user column to 32 buckets. First, calculate the hash value of the user column. The HDFS directory with the corresponding hash value 0 is: /wh/pvs/dt = 2014-08-01/ctry = US/part-00000; HDFS directory with hash value of 20 is: /wh/pvs/dt = 2014-08-01/ctry = US/part-00020
 
Use Case: partition a column, such as user ID, for example:
 
Create table weblog (user_id INT, url STRING, source_ip STRING)
 
> Partitioned by (dt STRING)
 
> Clustered by (user_id) INTO 96 BUCKETS; // After partitioning BY date, put the logs in 96 baskets BY user_id. When inserting data:
 
Hive> SET hive. enforce. bucketing = true;
 
Hive> FROM raw_logs
 
> Insert overwrite table weblog
 
> PARTITION (dt = '2017-02-25 ')
 
> SELECT user_id, url, source_ip WHERE dt = '2017-02-25'
 
Iv. Count (distinct)
 
When count distinct has many records, set the following two parameters:
 
Hive> hive. map. aggr = true
 
Hive> set hive. groupby. skewindata = true;
 
Hive> select count (distinct gid) from cookie_label_summary where I _date = 20130924;
 
V. Group
 
The Group By method is to perform some operations in reduce, which will lead to two problems:
 
Map end aggregation, partial calculation in advance: hive. map. aggr = true and set the interval: hive. groupby. mapaggr. checkinterval
 
Balanced processing: hive. groupby. skewindata
 
This is for data skew. When it is set to true, the reduce of the task splits the original job into two, in the first job, reduce processes different keys that are distributed immediately, generates intermediate results, and then processes them comprehensively.
 
6. Order by, Sort by, Dristribute by, Cluster
 
1. order by VS Sort by: order by is globally ordered and runs with only one reduce. mapred. in mode = strict mode, order by must be limit; otherwise, an error is returned. Sort by only ensures that the sorting in the same reduce is correct.
 
2. Distribute by with sort by: Distribute by distributes map output results to reduce based on specified columns. Therefore, we often use sort by to distribute the same values of a field to the same reduce sort.
 
3. Cluster by implements the Distribute by + sort by function.

Hive details: click here
Hive: click here

This article permanently updates the link address:

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.