First, Concept introduction
Hive does not have a dedicated data storage format and does not index the data, and users can organize the tables in hive very freely, simply by telling the column and row separators in the hive data when creating the table, and hive can parse the data
The table in hive is conceptually similar to the table in the database, and each table has a corresponding directory store data in hive. For example, a table PVS, its path in HDFS is:/wh/pvs, where WH is the directory of the data warehouse specified by ${hive.metastore.warehouse.dir} in Hive-site.xml, all table data (excluding Ex ternal Table) are stored in this directory.
Partition corresponds to a dense index of Partition columns in the database, but Partition in Hive are organized differently from the database. In Hive, a Partition in a table corresponds to a directory below the table, and all Partition data is stored in the corresponding directory. For example: PVs table contains DS and city two Partition, then corresponds to ds = 20090801, ctry = US HDFS subdirectory is:/wh/pvs/ds=20090801/ctry=us; corresponds to ds = 20090801 , ctry = The HDFS subdirectory for the CA;/wh/pvs/ds=20090801/ctry=ca
Buckets calculates the hash for the specified column, slicing the data according to the hash value, in order to parallel each Bucket corresponding to a file. Spread the user column to 32 buckets, first calculating the value of the user column hash, corresponding to a hash value of 0 of the HDFS directory is:/wh/pvs/ds=20090801/ctry=us/part-00000;hash value of 20 HD The FS directory is:/wh/pvs/ds=20090801/ctry=us/part-00020
External Table points to data that already exists in HDFS, you can create a Partition. It is the same as Table in the metadata organization, while the actual data is stored in a large difference. Table creation and data loading process (both processes can be completed in the same statement), during the loading of data, the actual data will be moved to the Data Warehouse directory, and then the data pair access will be done directly in the Data Warehouse directory. When you delete a table, the data and metadata in the table are deleted at the same time.
- External table has only one process, loading the data and creating the table at the same time (create External table ...). Location), the actual data is stored in the HDFS path specified behind the location and is not moved to the Data Warehouse directory. When you delete a External table, only the metadata is deleted, and the data in the table is not actually deleted.
Second, basic Operation command
Simple creation of tables
CREATE TABLE table_name ( ID int, dtdontquery string, name string)
to create a partitioned table
CREATE TABLE table_name ( ID int, dtdontquery string, name string) partitioned by (date String
A table can have one or more partitions, and each partition exists in a folder as a separate directory under the table folder.
The partition is present in the table structure as a field and can be viewed by the describe table command, but the field does not hold the actual data content, only the representation of the partition.
In a hive select query, the entire table content is generally scanned, and it consumes a lot of time to do unnecessary work. Sometimes you only need to scan a subset of the data in the table, so the partition concept is introduced when the table is built. A Partition in the table corresponds to a directory under the table, Partition is the auxiliary query, narrow the query scope, speed up the retrieval speed of the data and manage the data according to certain specifications and conditions.
A typical default creation table
CREATE TABLE Page_view ( viewtime INT, userid BIGINT, page_url string, referrer_url string, IP STRING COMMENT ' IP Address of the User ' COMMENT ' the Page view table ' partitioned by (DT string, country string) R OW FORMAT delimited fields TERMINATED by ' \001 ' COLLECTION ITEMS TERMINATED by ' \002 ' MAP KEYS TERMINATED by ' \003 ' STORED as textfile;
This creates a table Page_view, a comment for a table, a comment for a field IP, and a partition with two columns, DT and country, respectively.
The [ROW FORMAT delimited] keyword is used to set the column delimiter that is supported when the table is created when data is loaded. Different columns are separated by a ' \001 ', and the Elements of the collection (for example, Array,map) are delimited with ' \002 ', and the key and value in the map are split with ' \003 '.
The [STORED as File_format] keyword is used to set the data type of the load data, by default, Textfile, if the file data is plain text, using [STORED as Textfile], and then copied locally directly to HDFs, Hive can identify data directly.
commonly used to create tables
Login ( userid BIGINT, IP STRING, time BIGINT) partitioned by (DT string) ROW FORMAT Delimited '\ t' STORED as textfile;
Create an external table
If the data already exists on the '/user/hadoop/warehouse/page_view ' of HDFs, if you want to create the table, pointing to the path, you need to create an external table:
CREATE EXTERNAL TABLE page_view ( viewtime INT, userid BIGINT, page_url string, Referrer_url string, IP string COMMENT ' IP Address of the User ', country STRING COMMENT ' country of Origination ') COMMENT ' This is the Staging Page view table ' ROW FORMAT delimited fields TERMINATED by ' \054 ' STORED as textfile location '/user/hadoop/wareh Ouse/page_view ';
Create the table, there is a specified external is an external table, not specified is an internal table, the internal table will drop the data from HDFs, and the external table will not be deleted.
External tables and internal tables can have partitions, and if a partition is specified, after the external table is built, modify the table to add partitions.
If the external table has partitions, it can also load data, overwrite the partition data, but the external table deletes the partition, the data for the corresponding partition is not removed from HDFs, and the internal table deletes the partition data.
Specify database Creation table
If you do not specify a database, hive creates the table under the default database, assuming that there is a hive database mydb, to create the table to MyDB, as follows:
CREATE TABLE mydb.pokes (foo int,bar STRING) or use mydb; --point the current database to Mydbcreate TABLE pokes (foo int,bar STRING);
Duplicate table Structure
CREATE TABLE empty_table_name like table_name;
Create an empty table based on table_name empty_table_name,empty_table_name there is no data.
CREATE-TABLE-AS-SELECTT (CTAS)
The table created by CTAs is atomic, which means that the table will not be visible to other users until all the query results have been completed.
CTAs The only limitation is the target table, which cannot be a partitioned table or an external table.
The Simple Way
CREATE TABLE New_key_value_store as SELECT (key% 1024x768) New_key, concat (key, value) Key_value_pair from Key_value_stor E
A complex way
CREATE TABLE new_key_value_store ROW FORMAT SERDE "Org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe" STORED as Rcfile asselect (key% 1024x768) New_key, concat (key, value) Key_value_pairfrom Key_value_storesort by New_key, Key_ Value_pair;
Delete a table
DROP table Table_name;drop table IF EXISTS table_name;
Deleting a table removes the metadata and data from the table, and the data on HDFs is moved if trash is configured. The Trash/current directory.
When you delete an external table, the data in the table is not deleted.
TRUNCATE TABLE
TRUNCATE TABLE table_name; TRUNCATE TABLE table_name PARTITION (dt= ' 20080808 ');
Deleting all rows from a table or table partition, without specifying a partition, truncates all partitions in the table, or can specify multiple partitions at a time, truncating multiple partitions.
Load data related knowledge: http://blog.csdn.net/wacthamu/article/details/40744217
Partition Related knowledge: http://www.2cto.com/kf/201210/160777.html
Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.
Understanding of tables, external tables, partitions, and buckets in hive