Chao Wu Teacher's course--hive execution statement

Source: Internet
Author: User

Why Choose Hive?

(1) Hadoop-based computing/scaling capabilities for big data
(2) Support SQL like query language
(3) Unified meta-data management
(4) Simple programming


One: The data type of hive
(1) Basic data types
Tinyint/smallint/int/bigint
Float/double
Boolean
String
(2) Complex data types
Array/map/struct
No date/datetime.

(3) data storage for Hive
Data storage for hive is based on Hadoop HDFS
Hive does not have a dedicated data storage format
The storage structure mainly includes: database, file, table, view
Hive can load this file directly (textfile) by default, and also supports sequence file
When you create a table, you specify the column and row separators for hive data, and hive parses the data

(4) Data Model for hive-database

Database similar to traditional databases
Default database "Default"
After using the #hive command, do not use the Hive>use < database name, the system default database. Hive> Use default can be used explicitly;
Create a new library
Hive > CREATE Database test_dw;
(5) Data Model for hive-table
Table Internal Tables
Partition partition Table
External Table External Tables
Bucket table

(6) Data model for hive-internal table

is conceptually similar to the Table in the database
Each Table has a corresponding directory store data in Hive. For example, a table test whose path in HDFS is:/warehouse/test.

Warehouse is the directory of the data warehouse specified in Hive-site.xml by ${hive.metastore.warehouse.dir}
All table data (not including External table) is stored in this directory.
When you delete a table, both metadata and data are deleted

Create a data file Inner_table.dat
Create a table
Hive>create table Inner_table (key string);
Loading data
hive>Load Data local inpath '/root/inner_table.dat ' into table inner_table;
View data
SELECT * FROM Inner_table
Select COUNT (*) from inner_table
drop table Inner_table

(7) Hive data Model-partition table

partition corresponds to the Partition column Dense index
in Hive, a Partition in the table corresponds to a directory under the table, All Partition data is stored in the corresponding directory. /span>
For example: test table contains date and city two Partition,
corresponds to date= 20130201, city = BJ's HDFs subdirectory is:
/WAREHOUSE/TEST/DATE=20130201/CITY=BJ
corresponds to date=20130202, City=sh's HDFs subdirectory is;
/ Warehouse/test/date=20130202/city=sh

 create TABLE tmp_table #表名 (title  string   double  Span style= "color: #000000;" >,quantity bigint,have_invoice bigint) COMMENT     Note: XXX   "  #表注释 partitioned by (PT STRING) # Partition table field (if your file is very large, you can quickly filter out data by partition field using partitioned tables) ROW FORMAT delimited fields TERMINATED by   '  
' *tmp* ' ; #支持模糊查询SHOW partitions tmp_table; #查看表有哪些分区DESCRIBE tmp_table; #查看表结构
Create a data file Partition_table.dat create TABLE partition_table (Rectimestring, MSISDNstring) Partitioned by (daytimestring, Citystring) row format delimited terminated by'\ t'stored as textfile; load data to partition load database local Inpath'/home/partition_table.dat'into table partition_table partition (daytime='2013-02-01', city='BJ'); View DataSelect*From partition_tableSelectCOUNT (*) from Partition_table drop table partition_table
ALTER TABLE partition_table Add partition (daytime='2013-02-04', city='BJ  partition_table drop partition (Daytime=' ) via load data 2013-02-04', city='bj') metadata, data file deleted, but directory daytime=  04 isstill here.

(8) Data model for hive-bucket table

A bucket table is a hash of the data, which is then stored in a different file.

  CREATE TABLE created tables bucket_table ( id  Span style= "color: #0000ff;" >string ) Clustered by (id ) into 4  Span style= "color: #000000;"        > buckets;    Load Data set hive.enforce.bucketing  = true  ; Insert INTO table bucket_table  select          Name from Stu; Insert Overwrite table bucket_table  select   name from Stu; when data is loaded into a bucket table, the field is hashed and then modeled with the number of buckets. Put the data in the corresponding file. Sample query  select  * from bucket_table tablesample (bucket 1  out of 4  on id ); 

(9) Data Model for hive-External table

Point to data that already exists in HDFS , and you can create a Partition
It is the same as the internal table in the metadata organization, while the actual data storage has a large difference
The internal table creation process and the data loading process (both processes can be completed in the same statement), in the process of loading data, the actual data will be moved to the Data Warehouse directory;

Subsequent access to the data will be done directly in the Data Warehouse directory. When you delete a table, the data and metadata in the table are deleted at the same time
The external table has only one procedure, loading the data and creating the table at the same time , and does not move to the Data Warehouse directory, just to establish a link with the external data.

When you delete an external table, only the link is deleted

CREATE EXTERNAL TABLE page_view (viewtime INT, UserID BIGINT, Page_url string, Referrer_url string, IP STRING COMMENT'IP Address of the User', country STRING COMMENT'Country of origination ') COMMENT'This is the Staging Page view table'ROW FORMAT Delimited TERMINATED by' -'LINES TERMINATED by' A'STORED as Textfile location'Hdfs://centos:9000/user/data/staging/page_view';

Create a data file External_table.dat create a table hive>create External Table External_table1 (keystring) ROW FORMAT Delimited TERMINATED by'\ t'Location'/home/external'; Create a directory in HDFs/home/External#hadoop FS-put/home/external_table.dat/home/external Load Data Inpath'/home/external_table1.dat'into TABLE external_table1; View DataSelect*From external_tableSelectCOUNT (*) from external_table drop table external_table

View Actions

Select * from T1;

Operation of the table

table modified ALTER TABLE Target_tab add columns (cols,string) Table Delete drop tables

Import data

When the data is loaded into the table, no transformations are made to the data. The Load operation simply copies the data /'filepath' [OVERWRITE] into    TABLE tablename    [ PARTITION (Partcol1=val1, partcol2=val2 ...)] Import one hive table into another built-in hive table insert OVERWRITE table tablename [PARTITION (Partcol1=val1, partcol2=val2 ...)] Select_ Statement from From_statementctascreate [EXTERNAL] TABLE [IF not EXISTS] table_name     (col_name data_type, ...)    ..... As SELECT ... Example: Create TABLE  new_external_testasselect * from External_table1;

Inquire

Select| DISTINCT] select_expr, select_expr, ...    From Table_reference     [WHERE where_condition]     [GROUP by col_list]     | [Distribute by col_list] [SORT by col_list] | [ORDER by col_list]]    [LIMIT number]
partition-based queries the general SELECT query is a full table scan. But if it is a partitioned table, the query can take advantage of the feature of partition pruning (input pruning), similar to "partitioned Index", which scans only one part of the table that it cares about. The current implementation of Hive is that only the partition assertion (partitioned by) appears in the WHERE clause closest to the FROM clause to enable partition pruning. For example, if the Page_views table (by talent area) is usedDatecolumn partition, the following statement reads only the partition as ' --Geneva- on' data. SELECT page_views.* FROM Page_views WHERE page_views.Date>='2013-03-01'and Page_views.Date<='2013-03-01'Limit Clause limits can restrict the number of records queried. the results of the query are randomly selected. The following query statement randomly queries 5 records from the T1 table: SELECT* FROM T1 LIMIT5Top N Query the following query statement to query the largest sales record5a sales representative. SET Mapred.reduce.tasks=1SELECT * FROM Sales SORT by amount DESC LIMIT5
Table Connection


Import the AC information table hive> CREATE table Acinfo (namestring, ACIPstring)row format delimited terminated by '\ t'stored as textfile;hive> Load Data Local Inpath'/home/acinfo/ac.dat'into table acinfo; inner ConnectionSelectB.name,a.* from Dim_ac AJoinAcinfo b on (A.AC=B.ACIP) limitTen; left outer connectionSelectB.name,a.* from Dim_ac a left outerJoinAcinfo B on A.ac=b.acip limitTen;

Java Client

Hive Remote service starts #hive--service hiveserver >/dev/NULL  2>/dev/NULL&Java Client-related code class.forname ("Org.apache.hadoop.hive.jdbc.HiveDriver"); Connection Con= Drivermanager.getconnection ("Jdbc:hive://192.168.1.102:10000/wlan_dw","",""); Statement stmt=con.createstatement (); String Querysql="SELECT * from Wlan_dw.dim_m ORDER by flux desc limit"; ResultSet Res=stmt.executequery (Querysql);  while(Res.next ()) {System.out.println (res.getstring (1) +"\ t"+res.getlong (2)+"\ t"+res.getlong (3)+"\ t"+res.getlong (4)+"\ t"+res.getlong (5));}

Udf
1, the UDF function can be directly applied to the SELECT statement, the query structure to do the format processing, and then output the content.
2, write the UDF function when you need to pay attention to a few points:
A) custom UDFs need to inherit Org.apache.hadoop.hive.ql.UDF.
b) The Evaluate function needs to be implemented, and the Evaluate function supports overloading.
4. Steps
A) package the program on the target machine;
b) Enter the hive client and add the jar package:hive> addjar/run/jar/udf_test.jar;
c) Create temporary functions: hive>create temporary function add_example as ' hive.udf.Add ';
d) Query the HQL statement:
SELECT Add_example (8, 9) from scores;
SELECT add_example (Scores.math, Scores.art) from scores;
SELECT Add_example (6, 7, 8, 6.8) from scores;
e) Destroy temporary functions:hive> DROP temporary function add_example;
Note: UDF can only implement one-in-one-out operation, if you need to implement a multi-step out, you need to implement UDAF

Two: The execution statement of Hive

1. Internal Tables
Creating tables: Create TABLEt1(id int);
upload data for the created table: theLOAD data LOCAL inpath '/root/id ' into table T1;
You can also upload data using the Hadoop FS put Id/id method
separate two columns for the Created table: CREATE TABLE t2 (id int, name string) ROW FORMAT delimited fields TERMINATED by ' \ t ';
2. Partition Table
CREATE TABLE t3 (id int) partitioned by (day int);
LOAD DATA LOCAL inpath '/root/id ' into TABLE T1 PARTITION (day=22);
3. Bucket table
CREATE table t4 (id int) clustered by (ID) into 4 buckets;
Set hive.enforce.bucketing = true;
INSERT INTO table T4 Select id from T3;
4. External Tables
Create external Table T5 (id int) location '/external ';

Chao Wu Teacher's course--hive execution statement

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.