Hive Advanced Query (group BY, order BY, join, etc.)

Last Update:2015-07-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Query operations

groupbyorderbyjoinbybybyunion all

implementation of the underlying
mapreduce

Common aggregation Operations

Count counts

count(*)    所有值不全为NULL时，加1操作count(1)    不管有没有值，只要有这条记录，值就加1count(col)  col列里面的值为null，值不会加1，这个列里面的值不为NULL，才加1

Sum sum
sum(可转成数字的值）返回bigint

AVG Averaging
avg（可转成数字的值）返回double

Distinct number of different values
count(distinct col)

Order by

Sort by some fields
Sample Example

select col1,other...from tablewhere conditioorder by col1,col2 [asc|desc]

Note
Order by can have more than one column sorted by default, sorted by dictionary
Order BY is a global sort
The order by requires the reduce operation, and only one reduce, regardless of configuration. when the amount of data is very large, use caution.

Execution Process

Reads the data from the table, executes the where condition, makes a combination key with the value of the Col1,col2 column, other column values as value, and then passes the data to the same reduce, according to the sort order required.

Group by

Grouped by the values of some fields, with the same values put together.

Sample Example

select col1 [,col2] ,count(1),sel_expr（聚合操作）from tablewhere condition         -->Map端执行group by col1 [,col2]   -->Reduce端执行[having]                -->Reduce端执行

Note
A non-aggregate column after select must appear in GROUP by
In addition to the normal column in the Select is some aggregation operations
Group by can be followed by an expression, such as substr (COL)

features
Use the reduce operation, limited by the number of reduce , set the reduce parameter mapred.reduce.tasks
The output file is the same as the number of reduce, and the file size is related to the amount of data that reduce.

problem
Heavy Network Load
Data skew, the optimization parameter hive.groupby.skewindata is true, an optimizer is launched to avoid data skew.

Execution Process

The data is read from the table, the Where condition is executed, the col1 column is grouped, the content of the Col column is the key, the other column values are uploaded to reduce as value, and the aggregation operation and the having filter are performed on the reduce side.

Setmapred.reduce.tasks=5;Select* fromTabOrderOrder  byChASCNumdesc;Setmapred.reduce.tasks=3;SelectCH, COUNT (1) asNum fromTabOrderGroup  byChSetHive.groupby.skewindata =true;SelectCH, COUNT (1) asNum fromTabOrderGroup  byCh Having count (1) >2;SelectCol fromTableNameGroup  byCol <==>Select distinctCol fromTableName

Join table Join

Two table m,n is connected by an on condition, and one record in M and one record in n form a new record.

Join equivalent join (inner join), when only one value exists simultaneously in M and N.

left outer joinLeft outer joins, the values in the left table are output regardless of whether they exist in B or not, and the values in the right table are output only if they exist in the left table.

right outer joinand the left outer join opposite.

left semi joinSimilar exists . That is, find the data in table A, whether it exists in table B, and find out the existing data.

Mapjoin: completes the join operation on the map side, does not need to use reduce, the memory-based join, belongs to the optimization operation.

select  M.col as  col1, m.col2 as  col2, n.col3 as  col3 from  (select    Col1,col2 from , Test where  ... (map side Execution)) m (left table) [left  outer  | Right  outer  | left  semi] join  N (right table) on  m.col=n.colwhere  condition (reduced side execution) Span class= "Hljs-keyword" >set  hive.optimize.skewjoin=true ;

Read the data execution where condition, grouped by Col column, the contents of Col column as key, other columns as value, upload to reduce, perform connection operation and where filtering on the reduce side.

 Create table m (ch string,num string)row format delimited fields terminated  by c6> ' \ t ' lines  terminated by ' \ n ' stored as textfile;   load data local inpath '/liguodong/hivedata/m '  into table m; Create table N (ch string,num string)row format delimited fields terminated  by c5> ' \ t ' lines  terminated by ' \ n ' stored as textfile;   load data local inpath '/liguodong/hivedata/n '  into table n; Select *  from M; Select *  from N;Internal connection Select s.ch,s.num,t.num from (select ch,num  from M) sjoin(Select ch,num  from N) t on s.ch=t.ch;Left outer connection Select s.ch,s.num,t.num from (select ch,num  from M) s left outer Join (Select ch,num  from N) t on s.ch=t.ch;  Right outer connection Select s.ch,s.num,t.num from (select ch,num  from M) s Right outer  Join(select ch,num  from N) t on s.ch=t.ch;

Data Output comparison

select s.ch,s.num from(select ch,num from m)sleft semi join(select ch,num from n)ton s.ch=t.ch;运行结果：A       1C       5C       3

Mapjoin

Mapjoin (map side join)
Load the small table into memory on the map side, then read the large table, and the small table in memory to complete the connection operation. The distributed cache technology is used.

Advantages
Reduce resource consumption of the cluster (reduce is relatively scarce).
Reduces reduce operations and accelerates program execution.
Reduce Network load.

Disadvantages
Takes up part of the memory, so the tables that are loaded into memory cannot be too large because each compute node is loaded once.
Generate more small files.

Execution Process

Reads data from a large table and executes a where condition. The small table is loaded into memory, and every data read in a large table is compared to the small table data in memory.

The first way, the automatic way
Configure the following parameters
hive** automatically * * Based on SQL, choose to use common join or map join

set hive.auto.convert.join=true;hive.mapjoin.smalltable.filesize默认值是25mb

The second way, manually specifying

/*+mapjoin(n)*/ m.col, m.col2, n.col3 from mjoin n on m.col=n.col;

Note: /*+mapjoin(n)*/ You cannot omit it, just replace the table name n value.

Briefly summarize theusage scenarios for the map join :
1, the association operation has a table very small
2, not equivalent to the link operation

select c.city,p.province from(select province,city from city)cjoin(select province from province)pon c.province=p.province;mapjoin手动方式select /*+mapjoin(p)*/ c.city,p.province from(select province,city from city)cjoin(select province from province)pon c.province=p.province;

Compare the comparison time between the two.

Hive Bucket Join
For each table or partition, hive can be further organized into buckets, which means that buckets are more granular data range divisions.
Hive is a bucket for a column.
Hive uses a hash of the column values, divided by the number of buckets, to determine which bucket the record is stored in.
Benefits
Get higher query processing efficiency.
Make sampling (sampling) more efficient.

create table bucketed_user(id  int,name  string)clustered by (id) sorted by (name) into 4 bucketsrow format delimited fields terminated by ‘\t‘ stored as textfile;set hive.enforce.bucketing=true;

Use of sub-barrels

selectfrom1outof2on id)

Bucket Join

set  hive.optimize  .bucketmapjoin  =true  set  hive.optimize  .bucketmapjoin  .sortedmerge  =true set  hive.input  .format  =org.apache  .hadoop  .hive   .io

Connect two tables that divide buckets on the same column (including connection columns), and can be implemented efficiently using map-side joins (map side joins). such as join operations.
For join operations, two tables have an identical column if both tables are stabbed. Then the bucket with the same column values will be saved as a JOIN operation, which can greatly reduce the amount of data in the join.
For a map-side connection, two tables divide the buckets in the same way. Processing the mapper of a bucket in the left table knows that the matching rows in the right table are within the corresponding bucket. Therefore, mapper only needs to get the bucket (which is only a small portion of the data stored in the right table) to connect.
This optimization method does not necessarily require that two tables must be the same number of buckets, the number of buckets in two tables is a multiple relationship can also.

distribute by, sort by

Distribute Scatter data
Distribute by col– scattered data to different reduce according to Col column.

Sort sorts
Sort by col– sorting data according to Col column

select col1,col2 from Mdistribute by col1sort by col1 asc,col2 desc

The combination of both ensures that the output of each reduce is orderly.

Distribute by vs group by
Data is divided by key value
Both use the reduce operation
* * The only difference is that **distribute by simply disperses data, and group by brings together data from the same key, and subsequent must be an aggregation operation.

order by vs. sort by
Order BY is a global sort
Sort by simply ensures that the data output on each reduce is orderly. If there is only one reduce, the order by function is the same.

Execution Process

Reads the data from the table and executes the where condition.
Set the reduce number to 3, take the value of the distribute by column as the key, the other column values as value, and then pass the data to different reduce according to the key value, and then sort by field.

Application Scenarios
File size of map output is not uniform
Reduce output file size is uneven
Too many small files
File is super large

Put a large file in some small file Set mapred.reduce.tasks=5;-The following city will be output to five files Insert overwrite table cityselsct time, country,province,city  from Infodistribute  by province;Put some small files in a large file Set mapred.reduce.tasks=1;--The following province will be output to a large file Insert overwrite table province partition (dt=' 20150719 ') SELSCT time , Country,province  from city distribute  by country;Note: Province is a partitioned table.

Cluster by

Gather data with the same values and sort them together.
Effect equivalent to distribute by Col order by Col
Cluster by Col <==> distribute by Col order by Col

Union All

Data from multiple tables is combined into a single table, and hive does not support union

select col from(select a as col from t1union allselect b as col from t2)tmp

Execution Process

Reads the data from the table and executes the where condition. Merge into the same table.

UNION ALL must meet the following requirements
Field names are the same
Same as field type
The same number of fields
Child tables cannot have aliases
If you need to query data from a merged table, the merged table must have an alias

select * from (select * from munion allselect * from n)temp;如果两张表的字段名不一样，要将一个表修改别名同另一个表的字段名一样。select * from (select col1,col2 from munion allselect col1,col3 as col2 from n)temp;

Hive advanced Query (group BY, order BY, join, etc.)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Hive Advanced Query (group BY, order BY, join, etc.)

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support