009-hadoop Hive SQL Syntax 4-DQL operations: Data Query SQL

Source: Internet
Author: User
Tags joins

1 Basic Select Operation


SELECT [All | DISTINCT] select_expr, select_expr, ...
From table_reference
[WHERE where_condition]
[GROUP by col_list [have condition]]
[CLUSTER by Col_list
  | [Distribute by col_list] [SORT by| ORDER by col_list]

[LIMIT number]
• Use the all and distinct options to differentiate the processing of duplicate records. The default is all, which indicates that all records are queried. Distinct means to remove duplicate records
WHERE conditions
• A Where condition similar to our traditional SQL
• Currently supports AND,OR, version 0.9 supports between
combinatorial, not in
• EXIST not supported, not EXIST
ORDER by differs from sort by
ORDER by global sorting with only one reduce task
Sort by only in the native do sort

Limit


Limit can limit the number of records queried
SELECT * from T1 LIMIT 5
• Implement top K queries
• The following query statement queries the 5 sales reps with the largest sales record.
SET mapred.reduce.tasks = 1
SELECT * FROM Test SORT by Amount DESC LIMIT 5
regex Column Specification
The SELECT statement can use a regular expression to make a column selection, and the following statement queries all columns except DS and HR:
SELECT ' (ds|hr)? +.+ ' from Test


For example
Search by First piece
hive> SELECT A.foo from invites a WHERE a.ds= ' <DATE> ';


To output query data to a directory:
hive> INSERT OVERWRITE DIRECTORY '/tmp/hdfs_out ' SELECT a.* from invites a WHERE a.ds= ' <DATE> ';


output query results to a local directory:
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/local_out ' SELECT a.* from pokes A;


Select all columns to the local directory:
hive> INSERT OVERWRITE TABLE events SELECT a.* from profiles A;
hive> INSERT OVERWRITE TABLE events SELECT a.* from profiles a WHERE A.key <;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/reg_3 ' SELECT a.* from events A;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_4 ' select A.invites, a.pokes from profiles A;
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5 ' SELECT COUNT (1) from invites a WHERE a.ds= ' <DATE> ';
hive> INSERT OVERWRITE DIRECTORY '/tmp/reg_5 ' SELECT A.foo, A.bar from invites A;
hive> INSERT OVERWRITE LOCAL DIRECTORY '/tmp/sum ' SELECT sum (a.pc) from PC1 A;


Insert the statistical results of one table into another table:
hive> from invites a inserts OVERWRITE TABLE events SELECT A.bar, COUNT (1) WHERE a.foo > 0 GROUP by A.bar;
hive> INSERT OVERWRITE TABLE Events SELECT A.bar, COUNT (1) from invites a WHERE a.foo > 0 GROUP by A.bar;
JOIN
hive> from pokes T1 joins invites T2 on (T1.bar = t2.bar) INSERT OVERWRITE TABLE events SELECT T1.bar, T1.foo, T2.fo o;


Insert multiple table data into the same table:
From src
INSERT OVERWRITE TABLE dest1 SELECT src.* WHERE Src.key <
INSERT OVERWRITE TABLE dest2 SELECT src.key, Src.value WHERE src.key >= and Src.key <
INSERT OVERWRITE TABLE dest3 PARTITION (ds= ' 2008-04-08 ', hr= ') SELECT src.key WHERE src.key >= and Src.key &L T -
INSERT OVERWRITE LOCAL DIRECTORY '/tmp/dest4.out ' SELECT src.value WHERE src.key >=;


to insert a file stream directly into a file:
hive> from invites a INSERT OVERWRITE TABLE events SELECT TRANSFORM (A.foo, A.bar) as (Oof, Rab) USING '/bin/cat ' WH ERE a.ds > ' 2008-08-09 ';
This streams the data in the map phase through the Script/bin/cat (like Hadoop streaming). Similarly-streaming can used on the reduce side (please see the Hive Tutorial or examples)




2. Partition-based Queries

• The General SELECT query scans the entire table and uses the partitioned by clause to build the table, and the query can take advantage of the features of the partition pruning (input pruning)
Hive The current implementation is that partition pruning is enabled only if the partition assertion appears in the WHERE clause closest to the FROM clause

3.Join

Syntax
join_table:
table_reference JOIN table_factor [join_condition]
| table_reference {left| Right| Full} [OUTER] JOIN table_reference join_condition
| Table_reference left SEMI JOIN table_reference join_condition


table_reference:
Table_factor
| join_table


Table_factor:
Tbl_name [alias]
| Table_subquery alias
  | (table_references)


join_condition:
On equality_expression (and equality_expression) *


equality_expression:
expression = Expression
hive only supports equivalent connections (equality joins), Outer joins (outer joins), and (left semi joins). Hive does not support all non-equivalent connections because non-equivalent connections are very difficult to convert to Map/reduce tasks

left,right and full outer keywords are used to handle the case of a join hollow record
left SEMI JOIN is a more efficient implementation of the in/exists subquery
Join, the logic for each map/reduce task is this: reducer caches all the tables in the join sequence except for the last table, and then serializes the results to the file system through the last table
• In practice, the largest table should be written in the last


There are several key points to note when you join a query

only equivalent join IS supported
Select a.* from a JOIN b on (a.id = b.id)
Select a.* from a JOIN b
On (a.id = b.id and a.department = b.department)
• You can join more than 2 tables, such as
SELECT A.val, B.val, c.val from a JOIN b
On (A.key = b.key1) JOIN c on (C.key = B.key2)

• If the join key for multiple tables in a join is the same, the join is converted to a single map/reduce task
left,right and full OUTER


Example
Select A.val, B.val from a left OUTER JOIN B on (a.key=b.key)

• If you want to limit the output of a join, you should write the filter in the WHERE clause--or write in the join clause
• Easy to confuse problem is the case of table partitioning
Select C.val, d.val from C left OUTER JOIN D on (C.key=d.key)
WHERE a.ds= ' 2010-07-07 ' and b.ds= ' 2010-07-07 '
• If no record of the corresponding C table is found in the D table, all columns in the D table are listed as NULL, including the DS column. That is, join filters all records in the D table that match the C table join key cannot be found. In this case, the left OUTER causes the query result to be independent of the WHERE clause
• Solutions
Select C.val, d.val from C left OUTER JOIN D
On (C.key=d.key and d.ds= ' 2009-07-07 ' and c.ds= ' 2009-07-07 ')


Left SEMI JOIN
The limit of left SEMI join is that the table to the right of the join clause can only set the filter in the ON clause, not in the WHERE clause, the SELECT clause, or elsewhere

Select A.key, A.value
From a
WHERE A.key in
(SELECT b.key
from B);
can be rewritten as:
SELECT A.key, A.val
From a left SEMI JOIN b on (A.key = B.key)


UNION All
• To merge query results for multiple Select, you need to ensure that the fields in the select must be consistent

select_statement UNION ALL select_statement UNION ALL select_statement ...

009-hadoop Hive SQL Syntax 4-DQL operations: Data Query SQL

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.