Five tips for writing a hive program

Last Update:2015-03-17 Source: Internet

Author: User

Keywords Join CAN partition

Tags .mall apache computing create data default different distributed

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Using hive, you can write complex MapReduce query logic efficiently and quickly. In some cases, however, the Hive Computing task can become very inefficient or even impossible to get results, because it is unfamiliar with data attributes or if the Hive optimization convention is not followed. A "good" hive program still needs to have a deep understanding of the hive operating mechanism.

Some of the most familiar optimization conventions include the need to write large tables on the right side of the join, and try to use UDF instead of transfrom ... Like。 Here are 5 performance and logic-related issues to help you write better hive programs.

Full Order

Hive sort by, which is intentionally different from the order of the traditional database, and to emphasize the difference between the two –sort by can only be sorted in a single machine scope. Consider the following table definition:

CREATE TABLE if not exists t_order (id int,--order number sale_id int,--Sales idcustomer_id int,--Customer idproduct _id int,--Product Idamount INT--quantity) partitioned by (DS STRING);

Query all sales records in a table and sort by sales ID and quantity:

set mapred.reduce.tasks=2; Select sale_id, amount from T_ordersort by sale_id, amount;

This query may get an unexpected sort. The data distributed to the 2 reducer specified may be (in their own order):

Reducer1:

sale_id | amount0 | 1001 | 301 | 502 | 20

Reducer2:

sale_id | amount0 | 1100 | 1203 | 504 | 20

Because the above query does not have reduce key,hive, a random number is generated as the reduce key. In this way, input records are randomly distributed to different reducer machines. To ensure that there are no duplicate sale_id records between reducer, you can use the Distribute by keyword to specify that the distribution key is sale_id. The reformed HQL are as follows:

Set mapred.reduce.tasks=2; Select sale_id, amount from T_orderdistribute by Sale_idsort by sale_id, amount;

This ensures that the number of sales IDs corresponding to the collection of sales records in the query is sorted correctly, but the sales IDs are not sorted correctly because hive uses the Hadoop default Hashpartitioner to distribute the data.

This involves a full ranking problem. There are two ways to solve this problem:

1.) Do not distribute data and use a single reducer:

set Mapred.reduce.tasks=1;

The disadvantage of this approach is that the reduce end becomes a performance bottleneck and generally fails to achieve results in large amounts of data. However, this is still the most commonly used method in practice, because the usually sorted query is to get a number of top results, so you can use the LIMIT clause to greatly reduce the amount of data. With limit N, the number of data records transferred to the reduce side (stand-alone) is reduced to n (the number of maps).

2.) Modify Partitioner, this method can be done in full order. Here you can use Hadoop's own Totalorderpartitioner (Terasort project from Yahoo!), a partitioner to support the distribution of ordered data across reducer, It requires a Sequencefile format file to specify the distribution data range. If we have generated this file (stored in/tmp/range_key_list, divided into 100 reducer), you can rewrite the query as

set Mapred.reduce.tasks=100;set Hive.mapred.partitioner=org.apache.hadoop.mapred.lib.totalorderpartitioner; Set total.order.partitioner.path=/tmp/range_key_list; Select sale_id, amount from T_ordercluster by Sale_idsort by amount;

There are a number of ways to generate this interval file (for example, the O.a.h.mapreduce.lib.partition.inputsampler tool with Hadoop). This describes the method generated with hive, for example, with an ordered T_sale table with an ID:

CREATE TABLE if not exists T_sale (ID int,name string,loc string);

The method for generating interval files that are distributed by SALE_ID is:

Create external Table range_keys (sale_id int) row format Serde ' org. Apache.hadoop.hive.serde2.binarysortable.BinarySortableSerDe ' stored asinputformat ' org. Apache.hadoop.mapred.TextInputFormat ' OutputFormat ' org. Apache.hadoop.hive.ql.io.HiveNullValueSequenceFileOutputFormat ' location '/tmp/range_key_list '; Insert Overwrite table range_keysselect distinct sale_idfrom source T_sale sampletable (BUCKET $ on rand ()) Ssort by sale_id;

The generated files (under the/tmp/range_key_list directory) allow Totalorderpartitioner to distribute the reduce processing data in a sale_id order. The main problem to be considered in interval files is the balance of data distribution, which relies on a deep understanding of the data.

How does

do Cartesian product?

When the hive is set to strict mode (hive.mapred.mode=strict), the Cartesian product is not allowed in the HQL statement, which shows that the hive is weaker in the Cartesian product support. Because the join key,hive cannot be found, only 1 reducer can be used to complete the Cartesian product.

Of course, you can use the Limit method mentioned above to reduce the amount of data that a table participates in join. But for the needs of Cartesian product semantics, often a large table and a small table join operation, the result is still very large (so that the machine can not be processed), then Mapjoin is the best solution.

Mapjoin, as the name suggests, completes the join operation on the map side. This requires that one or more tables of the join operation be fully read into memory.

The use of Mapjoin is to add the/*+ mapjoin (tablelist)/prompt optimizer to Mapjoin after the SELECT keyword of the query/subquery (the current hive optimizer does not automatically optimize Mapjoin). Where tablelist can be a table, or a comma-linked list of tables. The table in the tablelist will be read into memory and the small table should be written here.

PS: Some users say mapjoin may appear unknown bugs in subqueries. When the large table and the small table do Cartesian product, the way to circumvent the Cartesian product is to add a join key to the join, which is simple: expand the small table into a list of join keys, copy the entries of the small table several times, join the keys differently, and expand the large table by one column join key as a random number.

How does

write the exist in clause?

Hive does not support subqueries in the WHERE clause, and SQL's commonly used exist in clause needs to be overwritten. This rewrite is relatively straightforward. Consider the following SQL query statement:

Select A.key, A.valuefrom awhere a.key in (select B.keyfrom b);

can be rewritten as

SELECT A.key, a.valuefrom a left OUTER JOIN b on (A.key = b.key) WHERE b.key <> NULL;

A more efficient implementation is to use the left semi join to rewrite to:

SELECT A.key, a.valfrom a left SEMI JOIN b on (a.key = B.key);

The left semi join is a feature of version 0.5.0.

Hive How to determine the number of reducer?

In the Hadoop MapReduce program, the reducer number setting greatly affects the execution efficiency, which makes hive how to decide reducer number becomes a key problem. Unfortunately, the hive estimation mechanism is very weak, without specifying the number of reducer, hive will guess the number of reducer, based on the following two settings:

1. Hive.exec.reducers.bytes.per.reducer (default is 1000^3)

2. Hive.exec.reducers.max (default is 999)

The formula for calculating the number of reducer is simple:

n=min (parameter 2, total input data volume/parameter 1)

In general, it is necessary to manually specify the number of reducer. Considering the amount of output data in the map phase is usually significantly lower than the input, resetting parameter 2 is necessary even if the number of reducer is not set. Based on Hadoop experience, parameter 2 can be set to 0.95* (number of Tasktracker in the cluster).

Merge MapReduce Operations

Multi-group by

Multi-group by is a very good feature of hive, which makes it very convenient to use intermediate results in hive. For example

from (SELECT a.status, B.school, B.genderfrom status_updates a JOIN profiles BON (A.userid = B.userid anda.ds= ' 2009-03-20 ') Subq1insert OVERWRITE TABLE gender_summarypartition (ds= ' 2009-03-20 ') SELECT Subq1.gender, COUNT (1) GROUP by Subq1.genderinsert OVERWRITE TABLE school_summarypartition (ds= ' 2009-03-20 ') SELECT Subq1.school, COUNT (1) GROUP by Subq1.school

The above query statement uses the Multi-group by attribute continuous group by 2 times data, using a different group by key. This feature reduces the mapreduce operation one time.

Multi-distinct

Multi-distinct is another multi-xxx feature developed by Taobao that uses MULTI-DISTINCT to use multiple distinct in the same query/subquery, which also reduces the number of mapreduce operations.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More