Hive RegEx insert join group CLI

Source: Internet
Author: User
1. Insert
During insert, The from clause can be placed after the select clause or before the insert clause. The following two statements are equivalent.
Hive> from invites a insert overwrite table eventsselect A. Bar, count (*) where a. Foo> 0 group by A. bar;
Hive> insert overwrite table events selecta. Bar, count (*) from invites a where a. Foo> 0 group by A. bar;
2. Export the file to a local device.
Insertoverwrite local directory '/tmp/local_out' select a. * From pokes;
A single source can be inserted into multiple target tables or target files at the same time. Multi-target insert can be completed in one sentence.
From SRC
Insert overwrite table dest1 select SRC. * Where SRC. Key <100
Insert overwrite table dest2 select SRC. Key, SRC. value where SRC. Key> = 100 and SRC. Key <200
Insert overwrite table dest3partition (DS = '1970-08-22 ', HR = '12') Select SRC. Key where SRC. Key> = 2014 andsrc. Key <200
Insert overwrite local directory '/tmp/dest4.out' select SRC. value where SRC. Key >=300;
Run a script (two methods)
Hive_home/bin/hive-f/home/My/hive-script. SQL
Hive_home/bin/hive-I/home/My/hive-init. SQL
3. Hive CLI
Hive> set I = 32;
Hive> set I;
Hive> select a. * From Xiaojun;
Hive>! Ls;
Hive> DFS-ls;
Eg:
Hive> set $ I = '2017. 61.99.14.128160791368.5 ';
Hive> selectcount (*) from c02_clickstat_fatdt1 where cookie_id = $ I;
4. RegEx Column
The SELECT statement can use regular expressions for column selection. The following statement queries all columns except DS and HR: Select '(DS | hr )? +. + 'From sales
 
5. sort by syntax:
The sort sequence depends on the column type. For numeric columns, the order is also numerical. If the string type columns are in alphabetical order.
Colorder: (ASC | DESC)
Sortby: sort by colname colorder? (', 'Colname colorder ?) *
Query: Select expression (', 'Expression) * from SRC sortby
6. Group
Advanced features:
Aggregation can be further divided into multiple tables or even files sent to hadoop DFS (operations can be performed and HDFS utilitites is used ). For example, we can divide the page by gender and find the unique page views by age. Example:
From pv_users
Insert overwrite table pv_gender_sum
Select pv_users.gender, count (distinct pv_users.userid)
Group by pv_users.gender
Insert overwrite directory '/user/Facebook/tmp/pv_age_sum'
Select pv_users.age, count (distinct pv_users.userid)
Group by pv_users.age;
 
Hive. Map. aggr can control how to summarize data. The default value is true. The configuration Unit performs the first-level aggregation directly on the map task. This usually provides better efficiency, but may require more memory to run successfully.
Set hive. Map. aggr = true;
Select count (*) from Table2;
PS: it may be more efficient to use in specific scenarios. However, it is much slower than simply using false.
1. Join
Hive only supports equality joins, outer joins, and left/right joins ). Hive does not support all non-equivalent connections, because non-equivalent connections are difficult to convert to map/reduce tasks. In addition, hive supports connections to more than two tables.
For example:
Select a. * froma join B on (A. ID = B. ID)
Select a. * froma join B
On (A. ID = B. ID and A. Department = B. Department)
Yes, however:
Select a. * froma join B on (A. id B. ID)
Yes.
A. You can join more than two tables.
For example
Select a. Val, B. Val, C. Val from a join B
On (A. Key = B. key1) join C on (C. Key = B. key2)
If the join key of multiple tables in a join operation is the same, the join operation is converted to a single map/reduce task. For example:
Select a. Val, B. Val, C. Val from a join B
On (A. Key = B. key1) join C
On (C. Key = B. key1)
Converted to a single map/reduce task, because only B. key1 is used as the join key in join.
Select a. Val, B. Val, C. Val from a join B on (A. Key = B. key1)
Join C on (C. Key = B. key2)
This join is converted into two map/reduce tasks. Because B. key1 is used for the first join condition, and B. key2 is used for the second join.
B. Logic of each map/reduce task during join:
CER caches records of all tables except the last table in the join sequence, and then serializes the results to the file system through the last table. This implementation helps reduce the memory usage on the reduce side. In practice, the largest table should be written at the end (otherwise, a large amount of memory will be wasted due to cache ). For example:
Select a. Val, B. Val, C. Val from
Join B on (A. Key = B. key1) join C on (C. Key = B. key1)
All Tables use the same join key (one map/reduce task ). The reduce end caches the records of table A and Table B, and calculates the join result every time a record of table C is obtained, similar to the following:
Select a. Val, B. Val, C. Val from
Join B on (A. Key = B. key1) join C on (C. Key = B. key2)
Two MAP/reduce tasks are used here. Table A is cached for the first time and serialized with table B. The results of the first map/reduce task are cached for the second time, and serialized with Table C.
The C. Left, right, and fullouter keywords are used to process join null records.
For example:
Select a. Val, B. Val from a left Outer
Join B on (A. Key = B. Key)
Each record in Table A has a record output. The output result is. val, B. val, when. key = B. key, while B. the Equivalent. a. val, null. The sentence "from a left Outer Join B" must be written in the same row-that is, Table A is on the left of Table B, so all records in Table A are retained; "aright Outer Join B" retains all records of Table B. The outer join semantics should follow the standard SQL spec.
Join occurs before the WHERE clause. If you want to limit the join output, you should write filtering conditions in the WHERE clause -- or write in the join clause. A confusing problem is Table Partitioning:
Select a. Val, B. Val from
Left outer joinb on (A. Key = B. Key)
Wherea. DS = '2017-08-22 'and B. DS = '2017-08-022'
Join Table A to table B (outerjoin) to list records of A. Val and B. Val. Other columns can be used as filter conditions in the WHERE clause. However, as described above, if the records corresponding to Table A cannot be found in Table B, all columns in Table B will list null, including the DS column. That is to say, join filters out all records that cannot match the join key of Table A in table B. In this case, leftouter makes the query result irrelevant to the WHERE clause. The solution is to use the following syntax during outer join:
Select a. Val, B. Val from a left Outer Join B
On (A. Key = B. keyand
B. DS = '2017-08-22 'and
A. DS = '2017-08-22 ')
The results of this query are filtered out in the join stage in advance, so the above problems do not exist. This logic can also be applied to right and full join operations.
Join cannot exchange locations. Both left and right join are connected on the left.
Select a. val1, A. val2, B. Val, C. Val
From
Join B on (A. Key = B. Key)
Left outer joinc on (A. Key = C. Key)
Join Table A to table B first, discard all records that do not match the join key, and then join the table C with the result of this middle. This statement is not obvious. When a key exists in both table A and Table C but does not exist in Table B: the entire record is joined for the first time, that is, a join B is lost (including. val1,. val2 and. key), and then when we join the C table, if C. key and. key or B. if the key is equal, the following result is obtained: NULL, null, null, C. val.

Hive RegEx insert join group CLI

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.