Alex's Hadoop cainiao Tutorial: tutorial 10th Hive getting started, hadoophive
Install Hive
Compared to many tutorials, I first introduced concepts. I like to install them first, and then use examples to introduce them. Install Hive first.
First confirm whether the corresponding yum source has been installed, if not as written in this tutorial install cdh yum source http://blog.csdn.net/nsrainbow/article/details/36629339
What is Hive?
Hive provides a way for you to use SQL to query data. But it is best not to use Hive for real-time queries. Hive is very slow because the implementation principle of Hive is to convert SQL statements into multiple Map Reduce tasks. The official documentation says Hive is applicable to scenarios with high latency and resource fees.
For example, you can query
hive> select * from h_employee;OK11peter22paulTime taken: 9.289 seconds, Fetched: 2 row(s)
This h_employee is not necessarily a database table
Metastore
All Tables created in Hive are called metastore tables. These tables do not actually store data, but define the ing between real data and hive, just like the meta information of tables in traditional databases, so they are called metastore. There are four storage modes that can be defined in actual storage:
- Internal table (default)
- Partition Table
- Bucket table
- External table
For example, this is a statement of the resume internal table.
CREATE TABLE worker(id INT, name STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054';
This statement creates a worker internal table. The internal table is of the default type, so you do not need to write the storage mode. And use commas as separators for storage
Basic Data Types supported by table creation statements
Tinyint/smalint/int/bigint
Float/double
Boolean
String
Complex data types
Array/Map/Struct
No date/datetime
Where does the created table exist? In/user/hive/warehouse, you can view the created table location through hdfs.
$ hdfs dfs -ls /user/hive/warehouseFound 11 itemsdrwxrwxrwt - root supergroup 0 2014-12-02 14:42 /user/hive/warehouse/h_employeedrwxrwxrwt - root supergroup 0 2014-12-02 14:42 /user/hive/warehouse/h_employee2drwxrwxrwt - wlsuser supergroup 0 2014-12-04 17:21 /user/hive/warehouse/h_employee_exportdrwxrwxrwt - root supergroup 0 2014-08-18 09:20 /user/hive/warehouse/h_http_access_logsdrwxrwxrwt - root supergroup 0 2014-06-30 10:15 /user/hive/warehouse/hbase_apache_access_logdrwxrwxrwt - username supergroup 0 2014-06-27 17:48 /user/hive/warehouse/hbase_table_1drwxrwxrwt - username supergroup 0 2014-06-30 09:21 /user/hive/warehouse/hbase_table_2drwxrwxrwt - username supergroup 0 2014-06-30 09:43 /user/hive/warehouse/hive_apache_accesslogdrwxrwxrwt - root supergroup 0 2014-12-02 15:12 /user/hive/warehouse/hive_employee
A folder corresponds to a metastore table. Hive tables of various types use internal tables.
CREATE TABLE workers( id INT, name STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY '\054';
With this statement, an internal table named workers is created, and the separator is comma and \ 054 is ASCII code.
We can use show tables; to check the number of tables. In fact, many hive statements imitate mysql. When you don't know the statements, you can use the mysql statements. Except limit, this will be explained later.
hive> show tables;OKh_employeeh_employee2h_employee_exporth_http_access_logshive_employeeworkersTime taken: 0.371 seconds, Fetched: 6 row(s)
After the creation, we try to insert several pieces of data. Here we want to tell you that Hive does not support single-sentence insert statements and must be used in batches. Therefore, we do not expect to use statements such as insert into workers values (1, 'jack') to insert data. Hive supports two data insertion methods:
- Read data from a file
- Insert from select)
Here I read data from a file. Create a file named worker.csv first
$ cat workers.csv1,jack2,terry3,michael
Use load data to import DATA to Hive tables
hive> LOAD DATA LOCAL INPATH '/home/alex/workers.csv' INTO TABLE workers;Copying data from file:/home/alex/workers.csvCopying file: file:/home/alex/workers.csvLoading data to table default.workersTable default.workers stats: [num_partitions: 0, num_files: 1, num_rows: 0, total_size: 25, raw_data_size: 0]OKTime taken: 0.655 seconds
Note:
- Do not drop the LOCAL file. The difference between load data local inpath and load data inpath is that you can find the source file from your LOCAL disk and find the file from hdfs.
- If OVERWRITE is added, you can clear the TABLE before importing it, for example, load data local inpath '/home/alex/workers.csv' overwrite into table workers;
Query data
hive> select * from workers;OK1jack2terry3michaelTime taken: 0.177 seconds, Fetched: 3 row(s)
Let's take a look at how the imported internal tables are saved in hive.
# hdfs dfs -ls /user/hive/warehouse/workers/Found 1 items-rwxrwxrwt 2 root supergroup 25 2014-12-08 15:23 /user/hive/warehouse/workers/workers.csv
It turns out that the file is copied intact! That's the land! We can try to put another file workers2.txt (I intentionally change the extension, but hive does not look at the extension)
# cat workers2.txt 4,peter5,kate6,ted
Import
hive> LOAD DATA LOCAL INPATH '/home/alex/workers2.txt' INTO TABLE workers;Copying data from file:/home/alex/workers2.txtCopying file: file:/home/alex/workers2.txtLoading data to table default.workersTable default.workers stats: [num_partitions: 0, num_files: 2, num_rows: 0, total_size: 46, raw_data_size: 0]OKTime taken: 0.79 seconds
Check the file storage structure.
# hdfs dfs -ls /user/hive/warehouse/workers/Found 2 items-rwxrwxrwt 2 root supergroup 25 2014-12-08 15:23 /user/hive/warehouse/workers/workers.csv-rwxrwxrwt 2 root supergroup 21 2014-12-08 15:29 /user/hive/warehouse/workers/workers2.txt
One workers2.txt is used for SQL query.
hive> select * from workers;OK1jack2terry3michael4peter5kate6tedTime taken: 0.144 seconds, Fetched: 6 row(s)
Partitioned Tables are used to accelerate query. For example, if you have a large amount of data, but your application scenario is to make daily reports based on the data, you can partition data by day, you only need to load the data of the report. Let's create a partition table.
create table partition_employee(id int, name string) partitioned by(daytime string) row format delimited fields TERMINATED BY '\054';
We can see the partition attributes. Instead of any column, we first create two test data files, which correspond to the data of two days.
# cat 2014-05-0522,kitty33,lily# cat 2014-05-0614,sami45,micky
Import to partition table
hive> LOAD DATA LOCAL INPATH '/home/alex/2014-05-05' INTO TABLE partition_employee partition(daytime='2014-05-05');Copying data from file:/home/alex/2014-05-05Copying file: file:/home/alex/2014-05-05Loading data to table default.partition_employee partition (daytime=2014-05-05)Partition default.partition_employee{daytime=2014-05-05} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]Table default.partition_employee stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]OKTime taken: 1.154 secondshive> LOAD DATA LOCAL INPATH '/home/alex/2014-05-06' INTO TABLE partition_employee partition(daytime='2014-05-06');Copying data from file:/home/alex/2014-05-06Copying file: file:/home/alex/2014-05-06Loading data to table default.partition_employee partition (daytime=2014-05-06)Partition default.partition_employee{daytime=2014-05-06} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]Table default.partition_employee stats: [num_partitions: 2, num_files: 2, num_rows: 0, total_size: 42, raw_data_size: 0]OKTime taken: 0.763 seconds
Partitions are specified through partition during import.
You can specify partitions for query.
hive> select * from partition_employee where daytime='2014-05-05';OK22kitty2014-05-0533lily2014-05-05Time taken: 0.173 seconds, Fetched: 2 row(s)
My query statement does not have any special syntax. hive will automatically determine whether your where statement contains partition fields. You can also use operators such as greater than or less
hive> select * from partition_employee where daytime>='2014-05-05';OK22kitty2014-05-0533lily2014-05-0514sami2014-05-0645mick'2014-05-06Time taken: 0.273 seconds, Fetched: 4 row(s)
Let's look at the storage structure.
# hdfs dfs -ls /user/hive/warehouse/partition_employeeFound 2 itemsdrwxrwxrwt - root supergroup 0 2014-12-08 15:57 /user/hive/warehouse/partition_employee/daytime=2014-05-05drwxrwxrwt - root supergroup 0 2014-12-08 15:57 /user/hive/warehouse/partition_employee/daytime=2014-05-06
Let's try a two-dimensional partition table.
create table p_student(id int, name string) partitioned by(daytime string,country string) row format delimited fields TERMINATED BY '\054';
Check some data
# cat 2014-09-09-CN 1,tammy2,eric# cat 2014-09-10-CN 3,paul4,jolly# cat 2014-09-10-EN 44,ivan66,billy
Import hive
hive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-09-CN' INTO TABLE p_student partition(daytime='2014-09-09',country='CN');Copying data from file:/home/alex/2014-09-09-CNCopying file: file:/home/alex/2014-09-09-CNLoading data to table default.p_student partition (daytime=2014-09-09, country=CN)Partition default.p_student{daytime=2014-09-09, country=CN} stats: [num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0]Table default.p_student stats: [num_partitions: 1, num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0]OKTime taken: 0.736 secondshive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-10-CN' INTO TABLE p_student partition(daytime='2014-09-10',country='CN');Copying data from file:/home/alex/2014-09-10-CNCopying file: file:/home/alex/2014-09-10-CNLoading data to table default.p_student partition (daytime=2014-09-10, country=CN)Partition default.p_student{daytime=2014-09-10, country=CN} stats: [num_files: 1, num_rows: 0, total_size: 19, raw_data_size: 0]Table default.p_student stats: [num_partitions: 2, num_files: 2, num_rows: 0, total_size: 38, raw_data_size: 0]OKTime taken: 0.691 secondshive> LOAD DATA LOCAL INPATH '/home/alex/2014-09-10-EN' INTO TABLE p_student partition(daytime='2014-09-10',country='EN');Copying data from file:/home/alex/2014-09-10-ENCopying file: file:/home/alex/2014-09-10-ENLoading data to table default.p_student partition (daytime=2014-09-10, country=EN)Partition default.p_student{daytime=2014-09-10, country=EN} stats: [num_files: 1, num_rows: 0, total_size: 21, raw_data_size: 0]Table default.p_student stats: [num_partitions: 3, num_files: 3, num_rows: 0, total_size: 59, raw_data_size: 0]OKTime taken: 0.622 seconds
Look at the storage structure
# hdfs dfs -ls /user/hive/warehouse/p_studentFound 2 itemsdrwxr-xr-x - root supergroup 0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-09drwxr-xr-x - root supergroup 0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-10# hdfs dfs -ls /user/hive/warehouse/p_student/daytime=2014-09-09Found 1 itemsdrwxr-xr-x - root supergroup 0 2014-12-08 16:10 /user/hive/warehouse/p_student/daytime=2014-09-09/country=CN
Query data
hive> select * from p_student;OK1tammy2014-09-09CN2eric2014-09-09CN3paul2014-09-10CN4jolly2014-09-10CN44ivan2014-09-10EN66billy2014-09-10ENTime taken: 0.228 seconds, Fetched: 6 row(s)
hive> select * from p_student where daytime='2014-09-10' and country='EN';OK44ivan2014-09-10EN66billy2014-09-10ENTime taken: 0.224 seconds, Fetched: 2 row(s)
The bucket table throws data to different buckets Based on the hash value of a field. Foreigners have a habit of placing buckets and adding different labels to the buckets when classifying objects. Therefore, when naming a table, they name it a bucket table. The bucket table is specially used for sampling and analysis.
The following example is a copy of the tutorial on the official website. Because partition tables and bucket tables can be used at the same time, the two features of partition and bucket are used in this example.
CREATE TABLE b_student(id INT, name STRING)PARTITIONED BY(dt STRING, country STRING)CLUSTERED BY(id) SORTED BY(name) INTO 4 BUCKETSrow format delimited fields TERMINATED BY '\054';
It means to calculate the hash value based on userid. I will not repeat the process of sorting and storing data by viewTIme. This is the imported data.
hive> select * from b_student;OK1tammy2014-09-09CN2eric2014-09-09CN3paul2014-09-10CN4jolly2014-09-10CN34allen2014-09-11ENTime taken: 0.727 seconds, Fetched: 5 row(s)
Sample and extract data from one bucket from four buckets
hive> select * from b_student tablesample(bucket 1 out of 4 on id);Total MapReduce jobs = 1Launching Job 1 out of 1Number of reduce tasks is set to 0 since there's no reduce operatorStarting Job = job_1406097234796_0041, Tracking URL = http://hadoop01:8088/proxy/application_1406097234796_0041/Kill Command = /usr/lib/hadoop/bin/hadoop job -kill job_1406097234796_0041Hadoop job information for Stage-1: number of mappers: 1; number of reducers: 02014-12-08 17:35:56,995 Stage-1 map = 0%, reduce = 0%2014-12-08 17:36:06,783 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.9 sec2014-12-08 17:36:07,845 Stage-1 map = 100%, reduce = 0%, Cumulative CPU 2.9 secMapReduce Total cumulative CPU time: 2 seconds 900 msecEnded Job = job_1406097234796_0041MapReduce Jobs Launched: Job 0: Map: 1 Cumulative CPU: 2.9 sec HDFS Read: 482 HDFS Write: 22 SUCCESSTotal MapReduce CPU Time Spent: 2 seconds 900 msecOK4jolly2014-09-10CN
External tables are stored by hive instead of Hbase. For example, Hbase can be used to store External tables. hive is just a ing. I use Hbase for example.
Create an Hbase table named employee first
hbase(main):005:0> create 'employee','info' 0 row(s) in 0.4740 seconds => Hbase::Table - employee hbase(main):006:0> put 'employee',1,'info:id',1 0 row(s) in 0.2080 seconds hbase(main):008:0> scan 'employee' ROW COLUMN+CELL 1 column=info:id, timestamp=1417591291730, value=1 1 row(s) in 0.0610 seconds hbase(main):009:0> put 'employee',1,'info:name','peter' 0 row(s) in 0.0220 seconds hbase(main):010:0> scan 'employee' ROW COLUMN+CELL 1 column=info:id, timestamp=1417591291730, value=1 1 column=info:name, timestamp=1417591321072, value=peter 1 row(s) in 0.0450 seconds hbase(main):011:0> put 'employee',2,'info:id',2 0 row(s) in 0.0370 seconds hbase(main):012:0> put 'employee',2,'info:name','paul' 0 row(s) in 0.0180 seconds hbase(main):013:0> scan 'employee' ROW COLUMN+CELL 1 column=info:id, timestamp=1417591291730, value=1 1 column=info:name, timestamp=1417591321072, value=peter 2 column=info:id, timestamp=1417591500179, value=2 2 column=info:name, timestamp=1417591512075, value=paul 2 row(s) in 0.0440 seconds
Create external table for ing
hive> CREATE EXTERNAL TABLE h_employee(key int, id int, name string) > STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler' > WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, info:id,info:name") > TBLPROPERTIES ("hbase.table.name" = "employee"); OK Time taken: 0.324 seconds hive> select * from h_employee; OK 1 1 peter 2 2 paul Time taken: 1.129 seconds, Fetched: 2 row(s)
Query syntax specific syntax can refer to the official manual https://cwiki.apache.org/confluence/display/Hive/Tutorial I only said a few strange points to show the number of x Data, the use of limit, such
hive> select * from h_employee limit 1 > ;OK11peterTime taken: 0.284 seconds, Fetched: 1 row(s)
However, the start point, such as offset, is not supported.
Class!