viii. Query Statement select for hive
In all database systems, the SELECT statement is the most used, but also the most complex piece, the query in hive Select support syntax is certainly more complex, this article only try to introduce. 8.1 Basic Query Syntax
The Select base syntax in hive is basically consistent with the standard SQL syntax, which supports where, DISTINCT, GROUP by, ORDER by, have, LIMIT, subquery, and so on;
The syntax is as follows:
[With Commontableexpression (, commontableexpression) *] SELECT [All | DISTINCT] select_expr, select_expr, ... From Table_reference [WHERE where_condition] [GROUP by col_list] [CLUSTER by col_list | [Distribute by col_list] [SORT by Col_list]] [LIMIT number]
Here are some of the more specific query syntax in hive. 8.2 ORDER by and sort by
The order by is used for global ordering, which is to globally sort all the specified sort keys, using the ORDER BY query statement, and finally using a reduce task to complete the global ordering.
Sort by is used for intra-partition sorting, which is the sort within each reduce task.
Look at the following example:
The original table data is:
Hive> select * from lxw1234_com; OK 5 3 6 2 9 8 1
Using ORDER BY
Hive> SELECT * from lxw1234_com the order by ID; 1 2 3 5 6 8 9
Use Sort by
Hive> set mapred.reduce.tasks=2;
Hive> SELECT * from lxw1234_com sort by ID;
2
5
6
9
1
3
8
Set 2 reduce, from the results can be seen, each reduce within a sort.
If the reduce number is 1, then the result of order by and sort by is the same:
Hive> set Mapred.reduce.tasks=1; Hive> SELECT * from lxw1234_com sort by ID; 1 2 3 5 6 8 9
In a real business environment, most of our requirements need to be done using order by global ordering. 8.3 Distribute by and cluster by
Distribute by: Divides the data by the specified field or expression and outputs it to the corresponding reduce or file.
Cluster by: In addition to the ability to combine distribute by, it also features sort by sorting.
# #distribute by
Hive> set mapred.reduce.tasks=2; Hive>insert Overwrite LOCAL directory '/tmp/lxw1234/' SELECT ID from the lxw1234_com distribute by ID;
After execution, the 000000_0 and 000001_0 two files are generated in the local/tmp/lxw1234 directory:
Cat 000000_0
8
2
6
Cat 000001_0 # #id%2=1 Records
1
9
3
5
# #cluster by
Hive> set mapred.reduce.tasks=2; hive> INSERT Overwrite LOCAL directory '/tmp/lxw1234/' SELECT ID from the lxw1234_com CLUSTER by ID; Cat 000000_0 2 6 8 Cat 000001_0 1 3 5 9
Note: After using cluster by, the IDs in each file are sorted, and distribute by IS not. 8.4 Sub-query
Subqueries are basically consistent with subquery syntax and usage in standard SQL, and it is important to note that if a select query is made from a subquery in hive, the subquery must set an alias.
Select col from (select A+b as col from T1) T2
Also, starting with Hive0.13, subqueries are supported in the WHERE clause, such as:
SELECT * from A WHERE a.a in (select Foo from B); Select A from T1 where EXISTS (select B from T2 where T1. X = T2. Y
There is also a syntax for querying subqueries as a table, called Common table Expression (CTE):
With Q1 as (SELECT * from src where key= ' 5 '), Q2 as (select * from src s2 where key = ' 4 ') SELECT *