Hive--Basic operating cases

Source: Internet
Author: User
Tags character set epoch seconds first string table name create database hadoop ecosystem

Reprint Please specify source: https://blog.csdn.net/l1028386804/article/details/80173778
I. Hive Overview 1, why hive is used

The birth of the Hadoop ecosystem brings dawn to efficient and fast processing of big data, but requires writing mapreduce or spark tasks, a high threshold for entry, and the need to master a programming language such as Java or Scala.
We have long been accustomed to traditional relational databases, and Structured Query Language (SQL) is relatively easy to learn, so that schemas like relational databases can be applied to the Hadoop file system, allowing you to query and manipulate data using the class-SQL language. Hive was born.
Hive provides a SQL dialect called the Hive Query Language (HQL) to query data stored in a Hadoop cluster. Hive is the equivalent of Mysql,mysql's underlying storage engine is InnoDB, and Hive's engine is the mapreduce of Hadoop, or spark,hive will convert most of the queries into MapReduce tasks or spark tasks, This cleverly combines the traditional SQL language with the Hadoop ecosystem, making it easy for SQL-only people to write data analysis tasks. 2. Hive Application Scenario

First look at the characteristics of hive, and then you can determine its application scenario.
First, hive is not a complete database, it relies on and is restricted by HDFs. The biggest limitation is that hive does not support record-level update, insert, or delete operations.
Second, Hadoop is a batch-oriented system that takes a long time to start a task, so hive query latency is more severe. The task of traditional database second-level queries also takes a long time to execute in hive.
Third, Hive does not support transactions.
In summary, Hive does not support OLTP (on-line Transaction processing) and is closer to being an OLAP (On-line Analytical Processing) tool. And it's just close to OLAP, because the latency of hive doesn't meet the "online" part of OLAP. Therefore, hive is the most suitable data warehouse application, do not need to respond quickly to the results, can be a large amount of data related to static data analysis, data mining, and then form a decision opinion or report.

So, what if users need to use OLTP for large-scale data? At this point we should choose a NoSQL database such as hbase, this database is characterized by random query speed, can meet the requirements of real-time query.

ii. data types and file formats 1. Basic Data Type

The basic data types in hive correspond to the corresponding type one by one in Java, such as String, which corresponds to a long in Java for String,bigint in Java. 2. Collection data type

1.STRUCT

Data Type length Example
TINYINT 1byte signed integer 20
SMALLINT 2byte signed integer 20
Int 4byte signed integer 20
BIGINT 8byte signed integer 20
BOOLEAN Boolean type, True or False TRUE
FLOAT Single-precision floating-point number 3.1415926
DOUBLE Double-precision floating-point number 3.1415926
STRING A sequence of characters. You can specify a character set. Single or double quotation marks can be used ' Hello world ', ' Hello Worlds '
TIMESTAMP Integer, floating-point number, or string 1327882394 (Unix epoch seconds)
1327882394.123456789 (Unix epoch seconds and followed by nanosecond number)
' 2012-02-03 12:34:56 ' (JDBC-compatible java.sql.Timestamp time format)
BINARY byte array can contain any byte
1. struct
struct (' John ', ' Doe ')

The struct data type is similar to "object" in Java, and the element content can be accessed through the point symbol. For example, if the data type of a column name is a struct {first string; last string}, then the second element can be referenced by Name.first

2.MAP

Map (' first ', ' John ', ' last ', ' Doe ')

Map is a pair of key-value pairs of tuples that can be accessed using array notation (e.g. [' key ']). For example, if the data type of a column name is map, where key-value pairs are ' first ', ' John ', and ' Doe ', then the last element can be obtained by name[

3.ARRAY

Array (' John ', ' Doe ')
3. Format of hive file

The format of the traditional relational file is externally hidden, and the format of the hive file storage is custom. Three, HQL common operation

The next step is to explain some of the basics and common operations of databases, tables, partitions, and so on in Hive, which is what we use most often. 1. Database Operation

1. View all databases

show databases;

2. Create a database

Create database if not exists Lyz;
The default value for this configuration is/user/hive/warehouse after the hive database is located in the top-level directory specified by the property hive.metastore.warehouse.dir. Hive creates a directory for each database, and the tables in the database are stored in the form of subdirectories of the database directory. When we create the arrival database, hive will create a directory/user/hive/warehouse/lyz.db, as shown in the figure:



One exception is the table in the default database, because the database itself does not have its own directory.

The user can also modify this default location by using the following command:

Create database if not EXISTS Lyz location '/myown/dir ';
3. View database Information
Desc database Lyz;

4. Deleting a database

Drop database if exists Lyz;
By default, hive does not allow users to delete a database that contains a table. The user either deletes the table from the database before deleting the database, or adds the keyword cascade at the end of the delete command, which allows hive to delete the table in the database first

Drop database if exists Lyz cascade;
5. Using the database
You must specify a database if you want to directly access the tables in the database, otherwise you need to prefix it, such as lyz.table
Use Lyz;
2. Table Operation

1. Create a table

CREATE TABLE IF not EXISTS lyz.employsalary (
name STRING COMMENT ' employee name ',
salary FLOAT COMMENT ' employee Salary ',
subordinates array<string> COMMENT ' Names of subordinates ',
deductions map<string, FLOAT >
COMMENT ' Keys is deductions names, values are percentages ',
address struct<street:string, city:string, state:string>
COMMENT ' Home address ')
COMMENT ' Description of the table '
partitioned by (year STRING, Month STRING)
ROW FORMAT
delimited fields TERMINATED by ' | '
COLLECTION ITEMS TERMINATED by ' \073 '
maps KEYS TERMINATED by ', '
LINES TERMINATED by ' \ n '
STORED as Textfil E location
'/user/hive/warehouse/lyz.db/employsalary '
tblproperties (' Creator ' = ' Me ', ' created_at ' = ' 2018-05-02 00:00:00 ');

First, if the current user's database is not the target database, you need to specify the database before the table name, that is, the Lyz in the example.
Partitioned by is a partitioned statement, which is detailed later.
The statement after the ROW format is how the contents of the previously mentioned file are split.
STORED as textfile means that all fields use letters, numbers, character encodings, including those international character sets, and each row is a separate record.
Location '/user/hive/warehouse/lyz.db/employsalary ' is used to define the position of the table, or it can not be specified, and this is the default location/user/hive/warehouse/lyz.db/ Employsalary.
Tblproperties can describe some information about a table.

Note that the table we create at this point is a management table, sometimes called an internal table. Because of this kind of table, hive controls the life cycle of the data, and when a management table is deleted, hive also deletes the data in the table, which is often not what we want to see, so we typically create an external table that is preceded by the external keyword:

CREATE EXTERNAL TABLE IF not EXISTS lyz.employsalary ...

When hive deletes the external table, it does not delete the data in the table, only the metadata information for the table is deleted.

2. View table information

Desc formatted employsalary_copy;

3. Copying a table
CREATE EXTERNAL TABLE IF not EXISTS lyz.employsalary_copy like
lyz.employsalary location
'/user/hive/warehouse /lyz.db/employsalary_copy ';
4. Delete a table
drop table if exists employsalary;

5. Inserting data into a table through a query statement

As mentioned earlier, Hive does not have row-level data insertions, updates, and deletions, so how to load data into a table. One way is to put the file directly under the table directory, another way is to query an existing table, the resulting data will be inserted into a new table, the equivalent of extracting data from the original table to the new table.

INSERT OVERWRITE TABLE employ
PARTITION (year= ' 2018 ', Month= ' ")
SELECT es.name, es.salary from Employsalary es
WHERE es.year= ' 2018 ' and es.month= ' 04 ';
The Overwrite keyword is used here, the contents of the previous partition will be overwritten, and if you do not want to be overwritten you can remove the keyword or use the INTO keyword.
There is also a problem, if the table is a large number of partitions, that every time the execution of this statement to the table Employsalary scan once, resulting in a lot of consumption. Hive provides another insert syntax that allows you to scan only one input data at a time and then divide it in a variety of ways. The following example shows how to import three months of data to a table employ and scan the original table employsalary only once:

From employsalary es
INSERT OVERWRITE TABLE employ
    PARTITION (year= ' 2018 ', month= ' ')
    SELECT es.name, Es.salary WHERE es.year= ' 2018 ' and es.month= ' '
INSERT OVERWRITE TABLE employ
    PARTITION (year= ' 2018 ', month= ' ")
    SELECT es.name, es.salary WHERE es.year= ' 2018 ' and es.month= ' OVERWRITE '
    PARTITION (year= ' 2018 ', Month= ' ")
    SELECT es.name, es.salary WHERE es.year= ' 2018 ' and es.month= ' 03 ';
3. Partition Operation

The benefit of partitioning is self-evident, when the query's WHERE clause plus partition filtering information, query performance will greatly improve, although there can be no partition, but because Hadoop to deal with a large amount of data, so try to add partitions to the table. And the fields of the partition are also fields in the table, which can be queried using the SELECT statement 1. Adding partitions

ALTER TABLE employsalary_copy ADD PARTITION (year= ' 2018 ', month= ') location
'/user/hive/warehouse/lyz.db/ Employsalary_copy/2018/05 ';
2. View partitions
Show Partitions employsalary_copy;

3. Modifying partitions

ALTER TABLE employsalary_copy PARTITION (year= ' 2018 ', month= ') SET location
'/user/hive/warehouse/lyz.db/ Employsalary_copy/2018/05 ';
4. Deleting a partition
ALTER TABLE employsalary_copy DROP IF EXISTS PARTITION (year= ' 2018 ', month= ' 05 ');
For management tables, even the use of ALTER table ... The Add PARTITION statement increases the partition, and the data in the partition is also deleted along with the metadata information. For external tables, data in the partition is not deleted.


Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.