Hive in Depth

Source: Internet
Author: User

Hive QL:

Create Table

Creates a table with the specified name. Throws an exception if a table of the same name already exists, and the user can ignore the exception with the If not EXIST option.

The EXTERNAL keyword allows the user to create an external table that specifies a path to the actual data while the Hive creates the internal table, which moves the data to the path that the data warehouse points to, and if the external table is created, only the path where the data resides is recorded. Do not make any changes to the location of the data. When you delete a table, the metadata and data for the internal table are deleted together, and the external table deletes only the metadata and does not delete the data.

Like allows the user to copy an existing table structure, but does not replicate the data.

A partitioned table can use the partitioned by statement when it is created. A table can have one or more partitions, and each partition exists under a separate directory. Also, tables and partitions can be CLUSTERED by operations on a column, placing several columns in a bucket. You can also sort the data by using sort by. This can improve performance for specific applications.

Drop Table

Deleting an internal table also deletes the table's metadata and data at the same time. Deletes an external table, deleting only the metadata and preserving the data.

Alter Table

The ALTER TABLE statement allows the user to change the structure of an existing table. Users can add columns/partitions, change Serde, add tables and Serde familiar, and rename the table itself.

ADD partitions

Users can use ALTER table add PARTITION to add partitions to a table. Quote when the partition name is a string

  ALTER TABLEPage_viewADDPARTITION (dt='2008-08-08', country='US') Location'/path/to/us/part080808'PARTITION (dt='2008-08-09', country='US') Location'/path/to/us/part080809';

DROP PARTITION

Users can use ALTER table add PARTITION to add partitions to a table. Quote when the partition name is a string

ALTER TABLE DROP PARTITION (dt='2008-08-08', country='US' );

Change Column name/type/position/comment

ALTER TABLE [COLUMN]   col_old_name col_new_name column_type    [COMMENT col_comment]    [first| After column_name]

This command allows the user to modify the name, data type, comment, or location of a column.

Like what:

CREATE TABLE test_change (a int, b int, c int);

ALTER TABLE test_change CHANGE a a1 INT;Change the name of column A to A1.

ALTER TABLE test_change CHANGE a a1 STRING AFTER b;Change the name of column A to the A1,a column to string and place it after column B. The new table structure is: B int, a1 string, C Int.

ALTER TABLE test_change CHANGE b b1 INT FIRST;The name of column B is modified to B1, and it is placed in the first column. The structure of the new table is: B1 int, a string, C int.

Note: Changes to columns will only modify Hive metadata without changing the actual data. The user should determine the consistency between the metadata definition and the actual data structure.

Add/replace Columns

ALTER TABLE ADD | REPLACE   COLUMNS (col_name[COMMENT col_comment], ...)

Add COLUMNS allows the user to add new columns at the end of the current column, but before the partitioning column.

REPLACE COLUMNS Delete the future columns and add the new columns. This can only be done when using native's SerDE (Dynamicserde or Metadatatypecolumnsetserde).

Loading files into table

When the data is loaded into the table, no transformations are made to the data. The Load operation simply copies/moves the data to the location corresponding to the Hive table.

Syntax:

LOAD [LOCAL] ' filepath ' [OVERWRITE]     into TABLE TableName     [PARTITION (Partcol1=val1, partcol2=val2 ...) ]

Synopsis:

The Load operation is simply a copy/move operation that moves the data file to the location corresponding to the Hive table.

  • The FilePath can be:
    • Relative paths, for example: project/data1
    • Absolute path, for example:/user/hive/project/data1
    • Contains the full URI of the pattern, for example: hdfs://namenode:9000/user/hive/project/data1
  • The loaded target can be a table or a partition. If the table contains partitions, you must specify the partition name for each partition.
  • FilePath can refer to a file (in this case, hive will move the file to the table's corresponding directory) or a directory (in which case, Hive will move all the files in the directory to the corresponding directory in the table).
  • If LOCAL is specified, then:
    • The load command will look for the filepath in the local file system. If the discovery is a relative path, the path is interpreted as the current path to the current user. The user can also specify a full URI for the local file, for example: File:///user/hive/project/data1.
    • The load command copies the files in the filepath to the target file system. The target file system is determined by the location properties of the table. The copied data file is moved to the location of the table's data.
  • If the LOCAL keyword is not specified, the URI will be used directly if FilePath points to a full uri,hive. Otherwise:
    • If no schema is specified or authority,hive uses the schema and authority,fs.default.name defined in the Hadoop configuration file, the URI of Namenode is specified.
    • If the path is not absolute, Hive is interpreted in relation to/user/.
    • Hive moves the contents of the file specified in filepath to the path specified in table (or partition).
  • If the OVERWRITE keyword is used, the content (if any) in the target table (or partition) is deleted, and then the contents of the file/directory pointed to by FilePath are added to the table/partition.
  • If the target table (partition) already has a file, and the file name conflicts with the file name in the filepath, the existing file will be replaced by the new one.

SELECT

Hive does not support in,exist or subqueries in the WHERE clause.

Partition-based queries

The General SELECT query scans the entire table (unless it is for a sample query). But if a table uses the partitioned by clause to build a table, the query can take advantage of the feature of partition pruning (input pruning) to scan only the part of a table that it cares about.

Having Clause

Hive now does not support having clauses. You can convert a HAVING clause into a single word query, for example:

SELECT  from GROUP  by  having SUM > Ten

You can use the following query to express:

SELECT  from (SELECTSUM as col2sum  fromGROUP by col1) T2   WHERE > Ten

Join

Hive supports only equivalent connections (equality joins), Outer joins (outer joins), and (left semi joins). Hive does not support all non-equivalent connections, because non-equivalent connections are very difficult to convert to map/reduce tasks. In addition, Hive supports connections of more than 2 tables.

If the join key (b.key1) of multiple tables in a join is the same, the join is converted to a single map/reduce task, for example:

  SELECT  from JOIN b      on (A.key=JOIN  C    on (c.key= b.key1)

The following join is converted to 2 map/reduce tasks. Because B.key1 is used for the first join condition, B.key2 is used for the second join.

SELECT  from JOIN  on (A.key=  b.key1)   JOIN  on (c.key= b.key2)

Join, the logic for each map/reduce task is this: reducer caches all the tables in the join sequence in addition to the last table, and then serializes the results to the file system through the last table. This implementation helps reduce the amount of memory used on the reduce side. In practice, the largest table should be written at the end (otherwise the cache will waste a lot of memory). For example:

SELECT  from a     JOIN  on (A.key=JOIN on (c.key= b.key1)

All tables use the same join key (calculated using 1 map/reduce tasks). The Reduce side caches records for table A and B, and then computes the join result each time a record of the C table is obtained, similar to the following, where 2 map/reduce tasks are used. Cache a table for the first time, serialize with B table, second cache the result of the first Map/reduce task, and then serialize with C table. :

  SELECT  from a     JOIN  on (A.key=JOIN on (c.key= b.key2)

Join occurs before the WHERE clause. If you want to limit the output of a join, you should write the filter in the WHERE clause--or write in the join clause. A confusing problem here is the case of table partitioning:

  SELECT  from  Left OUTER JOIN  on (A.key=B.key)   WHERE a.ds='2009-07-07' and b.ds= ' 2009-07-07 '

The

Joins table A to Table B (OUTER join), listing records for A.val and B.val. Other columns can be used as filter conditions in the WHERE clause. However, as mentioned earlier, if the records for table A are not found in table B, all columns in table B are listed as NULL, and include DS columns . That is, join filters all records in table B that do not match the join key in table A. In this case, the left OUTER causes the query result to be irrelevant to the WHERE clause. The workaround is to use the following syntax when OUTER JOIN:

SELECT  from  Left OUTER JOIN b    on (A.key=B.key and B.ds='2009-07-07'   and A.ds='2009-07-07')

The result of this query is pre-filtered in the join phase, so there is no such problem. This logic can also be applied to joins of right and full types.

The left SEMI JOIN is a more efficient implementation of the in/exists subquery. Hive does not currently implement in/exists subqueries, so you can rewrite your subquery statements with the left SEMI JOIN. The restriction of the left SEMI join is that the table to the right of the join clause can only set the filter in the ON clause, not in the WHERE clause, the SELECT clause, or elsewhere.

  SELECT A.key, A.value  from  a  WHERE A.key inch    (SELECT B.keyfrom     b);

Can be rewritten as:

SELECT A.key, A.val   from theleftJOIN on (A.key  = B.key)

With 4 finishing ends

Reference: http://www.oschina.net/question/12_7945

Hive in Depth

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.