Reprint Please specify source: https://blog.csdn.net/l1028386804/article/details/80173778
I. Hive Overview 1, why hive is used
The birth of the Hadoop ecosystem brings dawn to efficient and fast processing of big data, but requires writing mapreduce or spark tasks, a high threshold for entry, and the need to master a programming language such as Java or Scala.
We have long been accustomed to traditional relational databases, and Structured Query Language (SQL) is relatively easy to learn, so that schemas like relational databases can be applied to the Hadoop file system, allowing you to query and manipulate data using the class-SQL language. Hive was born.
Hive provides a SQL dialect called the Hive Query Language (HQL) to query data stored in a Hadoop cluster. Hive is the equivalent of Mysql,mysql's underlying storage engine is InnoDB, and Hive's engine is the mapreduce of Hadoop, or spark,hive will convert most of the queries into MapReduce tasks or spark tasks, This cleverly combines the traditional SQL language with the Hadoop ecosystem, making it easy for SQL-only people to write data analysis tasks. 2. Hive Application Scenario
First look at the characteristics of hive, and then you can determine its application scenario.
First, hive is not a complete database, it relies on and is restricted by HDFs. The biggest limitation is that hive does not support record-level update, insert, or delete operations.
Second, Hadoop is a batch-oriented system that takes a long time to start a task, so hive query latency is more severe. The task of traditional database second-level queries also takes a long time to execute in hive.
Third, Hive does not support transactions.
In summary, Hive does not support OLTP (on-line Transaction processing) and is closer to being an OLAP (On-line Analytical Processing) tool. And it's just close to OLAP, because the latency of hive doesn't meet the "online" part of OLAP. Therefore, hive is the most suitable data warehouse application, do not need to respond quickly to the results, can be a large amount of data related to static data analysis, data mining, and then form a decision opinion or report.
So, what if users need to use OLTP for large-scale data? At this point we should choose a NoSQL database such as hbase, this database is characterized by random query speed, can meet the requirements of real-time query.
ii. data types and file formats 1. Basic Data Type
The basic data types in hive correspond to the corresponding type one by one in Java, such as String, which corresponds to a long in Java for String,bigint in Java. 2. Collection data type
1.STRUCT
Data Type |
length |
Example |
TINYINT |
1byte signed integer |
20 |
SMALLINT |
2byte signed integer |
20 |
Int |
4byte signed integer |
20 |
BIGINT |
8byte signed integer |
20 |
BOOLEAN |
Boolean type, True or False |
TRUE |
FLOAT |
Single-precision floating-point number |
3.1415926 |
DOUBLE |
Double-precision floating-point number |
3.1415926 |
STRING |
A sequence of characters. You can specify a character set. Single or double quotation marks can be used |
' Hello world ', ' Hello Worlds ' |
TIMESTAMP |
Integer, floating-point number, or string |
1327882394 (Unix epoch seconds) 1327882394.123456789 (Unix epoch seconds and followed by nanosecond number) ' 2012-02-03 12:34:56 ' (JDBC-compatible java.sql.Timestamp time format) |
BINARY |
byte array |
can contain any byte |
1. struct
struct (' John ', ' Doe ')
The struct data type is similar to "object" in Java, and the element content can be accessed through the point symbol. For example, if the data type of a column name is a struct {first string; last string}, then the second element can be referenced by Name.first
2.MAP
Map (' first ', ' John ', ' last ', ' Doe ')
Map is a pair of key-value pairs of tuples that can be accessed using array notation (e.g. [' key ']). For example, if the data type of a column name is map, where key-value pairs are ' first ', ' John ', and ' Doe ', then the last element can be obtained by name[
3.ARRAY
Array (' John ', ' Doe ')
3. Format of hive file
The format of the traditional relational file is externally hidden, and the format of the hive file storage is custom. Three, HQL common operation
The next step is to explain some of the basics and common operations of databases, tables, partitions, and so on in Hive, which is what we use most often. 1. Database Operation
1. View all databases
show databases;
2. Create a database
Create database if not exists Lyz;
The default value for this configuration is/user/hive/warehouse after the hive database is located in the top-level directory specified by the property hive.metastore.warehouse.dir. Hive creates a directory for each database, and the tables in the database are stored in the form of subdirectories of the database directory. When we create the arrival database, hive will create a directory/user/hive/warehouse/lyz.db, as shown in the figure:
One exception is the table in the default database, because the database itself does not have its own directory.
The user can also modify this default location by using the following command:
Create database if not EXISTS Lyz location '/myown/dir ';
3. View database Information
Desc database Lyz;
4. Deleting a database
Drop database if exists Lyz;
By default, hive does not allow users to delete a database that contains a table. The user either deletes the table from the database before deleting the database, or adds the keyword cascade at the end of the delete command, which allows hive to delete the table in the database first
Drop database if exists Lyz cascade;
5. Using the database
You must specify a database if you want to directly access the tables in the database, otherwise you need to prefix it, such as lyz.table
Use Lyz;
2. Table Operation
1. Create a table
CREATE TABLE IF not EXISTS lyz.employsalary (
name STRING COMMENT ' employee name ',
salary FLOAT COMMENT ' employee Salary ',
subordinates array<string> COMMENT ' Names of subordinates ',
deductions map<string, FLOAT >
COMMENT ' Keys is deductions names, values are percentages ',
address struct<street:string, city:string, state:string>
COMMENT ' Home address ')
COMMENT ' Description of the table '
partitioned by (year STRING, Month STRING)
ROW FORMAT
delimited fields TERMINATED by ' | '
COLLECTION ITEMS TERMINATED by ' \073 '
maps KEYS TERMINATED by ', '
LINES TERMINATED by ' \ n '
STORED as Textfil E location
'/user/hive/warehouse/lyz.db/employsalary '
tblproperties (' Creator ' = ' Me ', ' created_at ' = ' 2018-05-02 00:00:00 ');
First, if the current user's database is not the target database, you need to specify the database before the table name, that is, the Lyz in the example.
Partitioned by is a partitioned statement, which is detailed later.
The statement after the ROW format is how the contents of the previously mentioned file are split.
STORED as textfile means that all fields use letters, numbers, character encodings, including those international character sets, and each row is a separate record.
Location '/user/hive/warehouse/lyz.db/employsalary ' is used to define the position of the table, or it can not be specified, and this is the default location/user/hive/warehouse/lyz.db/ Employsalary.
Tblproperties can describe some information about a table.
Note that the table we create at this point is a management table, sometimes called an internal table. Because of this kind of table, hive controls the life cycle of the data, and when a management table is deleted, hive also deletes the data in the table, which is often not what we want to see, so we typically create an external table that is preceded by the external keyword:
CREATE EXTERNAL TABLE IF not EXISTS lyz.employsalary ...
When hive deletes the external table, it does not delete the data in the table, only the metadata information for the table is deleted.
2. View table information
Desc formatted employsalary_copy;
3. Copying a table
CREATE EXTERNAL TABLE IF not EXISTS lyz.employsalary_copy like
lyz.employsalary location
'/user/hive/warehouse /lyz.db/employsalary_copy ';
4. Delete a table
drop table if exists employsalary;
5. Inserting data into a table through a query statement
As mentioned earlier, Hive does not have row-level data insertions, updates, and deletions, so how to load data into a table. One way is to put the file directly under the table directory, another way is to query an existing table, the resulting data will be inserted into a new table, the equivalent of extracting data from the original table to the new table.
INSERT OVERWRITE TABLE employ
PARTITION (year= ' 2018 ', Month= ' ")
SELECT es.name, es.salary from Employsalary es
WHERE es.year= ' 2018 ' and es.month= ' 04 ';
The Overwrite keyword is used here, the contents of the previous partition will be overwritten, and if you do not want to be overwritten you can remove the keyword or use the INTO keyword.
There is also a problem, if the table is a large number of partitions, that every time the execution of this statement to the table Employsalary scan once, resulting in a lot of consumption. Hive provides another insert syntax that allows you to scan only one input data at a time and then divide it in a variety of ways. The following example shows how to import three months of data to a table employ and scan the original table employsalary only once:
From employsalary es
INSERT OVERWRITE TABLE employ
PARTITION (year= ' 2018 ', month= ' ')
SELECT es.name, Es.salary WHERE es.year= ' 2018 ' and es.month= ' '
INSERT OVERWRITE TABLE employ
PARTITION (year= ' 2018 ', month= ' ")
SELECT es.name, es.salary WHERE es.year= ' 2018 ' and es.month= ' OVERWRITE '
PARTITION (year= ' 2018 ', Month= ' ")
SELECT es.name, es.salary WHERE es.year= ' 2018 ' and es.month= ' 03 ';
3. Partition Operation
The benefit of partitioning is self-evident, when the query's WHERE clause plus partition filtering information, query performance will greatly improve, although there can be no partition, but because Hadoop to deal with a large amount of data, so try to add partitions to the table. And the fields of the partition are also fields in the table, which can be queried using the SELECT statement 1. Adding partitions
ALTER TABLE employsalary_copy ADD PARTITION (year= ' 2018 ', month= ') location
'/user/hive/warehouse/lyz.db/ Employsalary_copy/2018/05 ';
2. View partitions
Show Partitions employsalary_copy;
3. Modifying partitions
ALTER TABLE employsalary_copy PARTITION (year= ' 2018 ', month= ') SET location
'/user/hive/warehouse/lyz.db/ Employsalary_copy/2018/05 ';
4. Deleting a partition
ALTER TABLE employsalary_copy DROP IF EXISTS PARTITION (year= ' 2018 ', month= ' 05 ');
For management tables, even the use of ALTER table ... The Add PARTITION statement increases the partition, and the data in the partition is also deleted along with the metadata information. For external tables, data in the partition is not deleted.