Hive optimization tips-How to Write HQL

Last Update:2014-08-29 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I. Hive join Optimization

1. try to place the small table on the left of join. The hive-0.12.0 we use here is automatically converted. This means that the small table is automatically loaded into the memory and the map side join is executed (with good performance ), this is done by the hive parameter. auto. convert. join = true and hive. smalltable. filesize = 25000000L) parameter control (25 MB by default). If the table file size is around 25 MB, You can adjust this parameter to perform map side join to avoid reduce side join. You can also display the Declaration for map join: This is especially suitable for small tables to join large tables, SELECT/* + MAPJOIN (B) */. key,. value FROM a join B on. key = B. key

2. Pay attention to join with table partitions, such:

SELECT. val, B. val FROM a left outer join B ON (. key = B. key) WHERE. dt = '2014-08-07 'AND B. dt = '2017-08-07'

Because hive first joins and then where, if the records in Table a cannot be found in Table B, all columns in Table B will list null, including the ds column, the query result of left outer is irrelevant to the where clause. solution:

SELECT. val, B. val FROM a left outer join B ON (. key = B. key AND. dt = '2014-08-07 'AND B. dt = '2017-08-07 '')

3. How to Write the exist/in clause?

Hive does not support subqueries in THE where clause. The exist in Clause commonly used in SQL must be rewritten. This rewrite is relatively simple. Consider the following SQL query statements:

SELECT a. key, a. value FROM a WHERE a. key in (SELECT B. key from B );

It can be rewritten

SELECT a. key, a. value FROM a left outer join B ON (a. key = B. key) WHERE B. key <> NULL;

A more efficient implementation is to use left semi join to rewrite:

SELECT a. key, a. val FROM a left semi join B on (a. key = B. key );

4. Hive join only supports equijoin and does not support non-equijoin.

5. reasonably use map join. occasion: small table A join large table,

Hadoop cluster-based Hive Installation

Differences between Hive internal tables and external tables

Hadoop + Hive + Map + reduce cluster installation and deployment

Install in Hive local standalone Mode

WordCount word statistics for Hive Learning

Hive operating architecture and configuration and deployment

2. Reasonably set the number of map and reduce tasks.

1. How to merge small files and reduce the number of maps?

If the number of maps in a table is very large, it may be because of the large number of files and the small size of the files, you can perform the following operations to merge the files ,:

Set mapred. max. split. size = 100000000; // 100 M

Set mapred. min. split. size. per. node = 100000000;

Set mapred. min. split. size. per. rack = 100000000;

Set hive. input. format = org. apache. hadoop. hive. ql. io. CombineHiveInputFormat; // merge small files

2. How to increase the number of maps?

If table A has only one file, which is 120 MB in size and contains tens of millions of records, you can use multiple map tasks to complete it.

Set mapred. reduce. tasks = 10;

Create table a_1

Select * from

Distribute by rand (123); // records in Table a are randomly distributed to the_1 table containing 10 files.

3. How does hive determine the reduce quantity? The reduce quantity is set based on the following parameters:

Hive.exe c. reducers. bytes. per. Cer CER (the data volume processed by each reduce task. The default value is 1000 ^ 3 = 1G)

Hive.exe c. reducers. max (the maximum number of reduce tasks. The default value is 999)

The formula for calculating the number of reducers is simple: N = min (parameter 2, total input data volume/parameter 1)

That is, if the total size of reduce input (map output) does not exceed 1 GB, only one reduce task is available. Therefore, adjust the following parameters:

Set hive.exe c. Fetch CERs. bytes. per. Cer CER = 500000000; (500 M)

Set mapred. reduce. tasks = 15;

3. If you design and use a bucket,

Buckets computes the hash of a specified column and splits the data based on the hash value. The objective is to parallel, and each Bucket corresponds to a file. Distribute the user column to 32 buckets. First, calculate the hash value of the user column. The HDFS directory with the corresponding hash value 0 is: /wh/pvs/dt = 2014-08-01/ctry = US/part-00000; HDFS directory with hash value of 20 is: /wh/pvs/dt = 2014-08-01/ctry = US/part-00020

Use Case: partition a column, such as user ID, for example:

Create table weblog (user_id INT, url STRING, source_ip STRING)

> Partitioned by (dt STRING)

> Clustered by (user_id) INTO 96 BUCKETS; // After partitioning BY date, put the logs in 96 baskets BY user_id. When inserting data:

Hive> SET hive. enforce. bucketing = true;

Hive> FROM raw_logs

> Insert overwrite table weblog

> PARTITION (dt = '2017-02-25 ')

> SELECT user_id, url, source_ip WHERE dt = '2017-02-25'

Iv. Count (distinct)

When count distinct has many records, set the following two parameters:

Hive> hive. map. aggr = true

Hive> set hive. groupby. skewindata = true;

Hive> select count (distinct gid) from cookie_label_summary where I _date = 20130924;

V. Group

The Group By method is to perform some operations in reduce, which will lead to two problems:

Map end aggregation, partial calculation in advance: hive. map. aggr = true and set the interval: hive. groupby. mapaggr. checkinterval

Balanced processing: hive. groupby. skewindata

This is for data skew. When it is set to true, the reduce of the task splits the original job into two, in the first job, reduce processes different keys that are distributed immediately, generates intermediate results, and then processes them comprehensively.

6. Order by, Sort by, Dristribute by, Cluster

1. order by VS Sort by: order by is globally ordered and runs with only one reduce. mapred. in mode = strict mode, order by must be limit; otherwise, an error is returned. Sort by only ensures that the sorting in the same reduce is correct.

2. Distribute by with sort by: Distribute by distributes map output results to reduce based on specified columns. Therefore, we often use sort by to distribute the same values of a field to the same reduce sort.

3. Cluster by implements the Distribute by + sort by function.

Hive details: click here
Hive: click here

This article permanently updates the link address:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hive optimization tips-How to Write HQL

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Hive optimization tips-How to Write HQL

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support