Hive_6. Data aggregation--Group by & Grouping_sets & RollUp & CUBE & has

Source: Internet
Author: User
Tags spark rdd

Today I'd like to introduce you to some of the advanced operations in Hive-data aggregation. This is mainly based on the following three sections to introduce the common aggregation in hive: Advanced aggregation based on Group by for basic aggregate functions-GROUPING SETS & ROLLUP and CUBE aggregation conditions-having 1. Based on Basic aggregate functions for Group by

Data aggregation collects and expresses more information in the form of a data rollup based on specific conditions. Hive provides a number of built-in aggregate functions, such as Max, MIN, Avg, and so on. Hive also supports advanced aggregation: GROUPING SETS, ROLLUP, CUBE, parse function [analytic functions], and windowing.
Hive's basic built-in aggregate functions typically use the GROUP BY clause. If it is not specified by the GROUP BY clause, by default it aggregates the entire table. In addition to aggregate functions, all other select columns must also be included in the GROUP BY clause (except for analytic functions). Here are a few examples of using the built-in aggregate functions:
Note: For window functions & partition Table functions, refer to the SQL Windowing project: http://blog.csdn.net/mike_h/article/details/50245995 aggregates without a group by field:

<span style= "FONT-SIZE:12PX;" >jdbc:hive2://> SELECT Count (*) as row_cnt from employee;
+----------+
| row_cnt |
+----------+
| 5 |
+----------+
1 row selected (60.709 seconds</span>
To aggregate the GROUP by field:
<span style= "FONT-SIZE:12PX;" >jdbc:hive2://> SELECT Sex_age.sex, COUNT (*) as row_cnt ... .> from employee .... 
> GROUP by Sex_age.sex;
+--------------+----------+
| sex_age.sex | row_cnt
| +--------------+----------+
| Female | 2 |
| Male | 3 |
+--------------+----------+
2 rows selected (100.565 seconds) </span>

The--select field name must be included in the Group by field
<span style= "FONT-SIZE:12PX;" >jdbc:hive2://> SELECT name, Sex_age.sex, COUNT (*) as row_cnt ... 
.> from employee GROUP by Sex_age. sex;
Error:error while compiling statement:FAILED:SemanticException [Error 10025]: line 1:7 Expression don't in GROUP by key ' Name ' (state=42000,code=10025) </span>

If we have to SELECT Some fields that are not in GROUP by, we have two methods:
using [analytic functions], after introduction, completely avoiding the use of the GROUP BY clause, which will be introduced later, uses the Collect_set function, which returns a set of objects and eliminates duplicate elements

Method 2 uses the following:--statistics on the number of persons per gender based on gender and data sampling

<span style= "FONT-SIZE:12PX;" >jdbc:hive2://> SELECT sex_age.sex,
... .> collect_set (sex_age.age) [0] as Random_age, 
... .> count (*) as row_cnt ... 
.> from employee GROUP by Sex_age.sex;
+--------------+-------------+----------+
| sex_age.sex | random_age | row_cnt
| +--------------+-------------+----------+
| Female | 27 | 2 |
| Male | 35 | 3 |
+--------------+-------------+----------+
2 rows selected (48.15 seconds) </span>
In a SELECTThere can be different aggregate functions in the statement. Of course, you can also use other functions, such as nested conditional functions. Nested aggregate functions are not supported, however. For more information, refer to the following example:
To invoke multiple aggregate functions in a SELECT statement:
<span style= "FONT-SIZE:12PX;" >jdbc:hive2://> SELECT Sex_age.sex, AVG (sex_age.age) as Avg_age, 
... .> count (*) as row_cnt 
. . . . . . . > From employee GROUP by Sex_age.sex; 
+--------------+---------------------+----------+
| sex_age.sex | avg_age | row_cnt
| +--------------+---------------------+----------+
| Female | 42.0 | 2 |
| Male | 31.666666666666668 | 3 |
+--------------+---------------------+----------+
2 rows selected (98.857 seconds) </span>
Use the aggregate function with case:
<span style= "FONT-SIZE:12PX;"  >jdbc:hive2://> SELECT sum (case when sex_age.sex = ' Male ' ... 
.> THEN sex_age.age ELSE 0 end)/
. . . . . . . > count (case when sex_age.sex = ' Male ' THEN 1 
... .> ELSE NULL end) as Male_age_avg from employee;
+---------------------+
| male_age_avg |
+---------------------+
| 31.666666666666668 |
+---------------------+
1 row selected (38.415 seconds) </span>
Use aggregate functions with coalesce and IF:
<span style= "FONT-SIZE:12PX;" >jdbc:hive2://> SELECT  
... .> sum (COALESCE (sex_age.age,0)) as Age_sum,      --Returns the first Null in a set of data Value, NULL if all is null ...
.> sum (if (sex_age.sex = ' Female ', sex_age.age,0))--False Then 0
... .> as female_age_sum from employee;
+----------+---------------+
| age_sum | female_age_sum|
+----------+---------------+
| 179
| +----------+---------------+
1 row selected (42.137 seconds) </span>
Nested aggregate functions are not allowed, as follows: (The operation of the aggregate function is based on the initial granularity of the operation, similar to the Spark RDD)
<span style= "FONT-SIZE:12PX;" >jdbc:hive2://> SELECT AVG (COUNT (*)) as row_cnt ...
.> from employee;
Error:error while compiling statement:FAILED:SemanticException [Error 10128]: line 1:11 don't yet supported place for UDA F ' Count ' (state=42000,code=10128) </span>

Aggregate functions can use the DISTINCT keyword to return a unique aggregate value.

<pre name= "code" class= "plain" ><span style= "font-size:12px;" >jdbc:hive2://> SELECT count (DISTINCT sex_age.sex) as sex_uni_cnt,
... .> count (DISTINCT name) as Name_uni_cnt ... 
.> from employee;     
+--------------+---------------+
| sex_uni_cnt |  name_uni_cnt |
+--------------+---------------+
| 2            | 5             |
+--------------+---------------+
1 row selected (35.935 seconds) </span>

Note:

When we put COUNT and DISTINCT together, Hive usually ignores the reducer number setting (for example, mapred.reduce.tasks = 20) and uses only one reducer at this time. In the case of a large amount of data, a single reducer will obviously become a performance bottleneck. Of course, the compromise solution is to use subqueries:

--only single reducer are triggered throughout the process
SELECT count (distinct sex_age.sex) as sex_uni_cnt from employee;

Select COUNT (*) as sex_uni_cnt from (select distinct sex_age.sex from employee) A;

In this case, the first step uses multiple reducer to implement the DISTINCT query, allowing the data to obtain a unique value, and the mapper output to the count phase of the partition will also decrease, resulting in reducer no load pressure.

When you use Hive to process data, you may experience null aggregation fields, in which case the second row will be ignored if a single null field is included in the row. To avoid this, we can use coalesce to assign a default value to a Null field. Specifically implemented as follows:


<span style= "FONT-SIZE:12PX;"  >jdbc:hive2://> CREATE TABLE t as select *
from ... .> (select Employee_id-99 as Val1, 
...) . .> (employee_id-98) as Val2 from employee_hr ... 
.> WHERE employee_id <= ... .>.
UNION all
... .> SELECT null VAL1, 2 as Val2 from employee_hr ... 
.> WHERE employee_id = 100 ) A;
No rows affected (0.138 seconds) </span>

--Check the rows that create the table
<span style= "FONT-SIZE:12PX;" >jdbc:hive2://> SELECT * from T;
+---------+---------+
| t.val1 |  T.VAL2 |
+---------+---------+
| 1       | 2       |
| NULL    | 2       |
| | 2       | 3       |
+---------+---------+
3 rows selected (0.069 seconds) </span>

--In the Sum (VAL1+VAL2) operation, the 2nd line is ignored (NULL, 2)
<span style= "FONT-SIZE:12PX;" >jdbc:hive2://> SELECT sum (val1), sum (val1+val2) ... 
.> from T;                   
+------+------+
| _c0 |  _C1 |
+------+------+
| 3    | 8    |
+------+------+
1 row selected (57.775 seconds)

jdbc:hive2://> SELECT sum (COALESCE (val1,0)), 
... .> sum (COALESCE (val1,0) +val2) from T;
+------+------+
| _c0 |  _C1 |
+------+------+
| 3    |
+------+------+
1 row selected (69.967 seconds) </span>

The Hive.map.aggr property is used to control aggregations in the map task, and its default value is False, and if set to true, Hive will directly perform the first phase of aggregation directly in the map task, although it can improve performance but also consumes a lot of memory.

Jdbc:hive2://> SET hive.map.aggr=true;
No rows affected (0.002 seconds)

2. Advanced Polymerization--GROUPING SETS

The hive provides GROUPING SETS keyword implementations for multiple group by operations on the same dataset. In fact, GROUPING SETS integrates all the processing operations at one stage of the Job, and is obviously more efficient than the GROUP by and UNION all multistage operations. If the grouping SETS () argument is empty, it will be aggregated collectively. The following example will introduce the equivalence of grouping SETS. For a better understanding, we can interpret the GROUPING SETS as an external implementation of union all and an internal implementation of each group by in union all.

<span style= "FONT-SIZE:12PX;" >select name, work_place[0] as Main_place, COUNT (employee_id) as emp_id_cnt from employee_id GROUP by name, Work_plac
E[0] GROUPING SETS ((name, work_place[0)); 

SELECT name, work_place[0] as Main_place, COUNT (employee_id) as emp_id_cnt from employee_id GROUP by name, Work_place[0] SELECT name, work_place[0] as Main_place, COUNT (employee_id) as emp_id_cnt from employee_id GROUP by name, work_place[
0] GROUPING SETS (name, work_place[0]); | |  Select name, null as Main_place, COUNT (employee_id) as emp_id_cnt from employee_id GROUP by name UNION all SELECT NULL as

Name, Work_place[0] as Main_place, COUNT (employee_id) as emp_id_cnt from employee_id GROUP by work_place[0];  
SELECT name, work_place[0] as Main_place, COUNT (employee_id) as emp_id_cnt from employee_id GROUP by name, Work_place[0]
GROUPING SETS ((name, work_place[0), name); SELECT name, work_place[0] as Main_place, COUNT (employee_id) as emp_id_cnt from employee_id GROUP by NamE, work_place[0] UNION all SELECT name, NULL as Main_place, Count (employee_id) as emp_id_cnt from employee_id GROUP by NA

Me 
SELECT name, work_place[0] as Main_place, COUNT (employee_id) as emp_id_cnt from employee_id GROUP by name, Work_place[0]
GROUPING SETS ((name, work_place[0)), name, Work_place[0], (); 
SELECT name, work_place[0] as Main_place, COUNT (employee_id) as emp_id_cnt from employee_id GROUP by name, Work_place[0] UNION ALL SELECT name, NULL as Main_place, Count (employee_id) as emp_id_cnt from employee_id GROUP by name UNION ALL SEL
ECT NULL as name, Work_place[0] as Main_place, COUNT (employee_id) as emp_id_cnt from employee_id GROUP by work_place[0] UNION all SELECT null as name, NULL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.