Hive Basics for Offline Processing

Source: Internet
Author: User
Keywords big data hive hive Introduction
HIVE is contributed by Facebook development to the Hadoop open source community.
He can help OLAP analysts use simple SQL statements for data analysis without foundation. Its principle is also very simple, hivesql is first parsed by sql to produce an executable mr plan, and finally handed over to hadoop for processing.
Hadoop is batch processing, so hive is also high latency. It does not provide data sorting and query buffering, as well as online transaction and record level updates.

The main structure of hive:
Driver formation: core components, with compiler optimizer and executor, can turn sql into rm task.
MetaStore: Metadata. Hive's metadata is stored in it. General metadata is stored in a relational database, the most commonly used is msql, and Derby. But the latter is only suitable for simple tests
CLI: Command line interface, its responsible class is CliDriver in hive-cli
Thrift Server: Provides a JDBC interface for scalable and cross-language services
HIVE WEB INTERFACE (HWI): In short, it is an operable UI interface

I think hwi is the most convenient and easy to use.
Configure hwi also briefly introduce:
1. Download the src package, the hive we downloaded are all bin packages. It is best to download the corresponding version of the src package, there is another thing to note is that it is best to use the 1.. version. There is no hwi.jar (personal test result) in hive/lib of version 2 or above.
Package all files in the web file in src into a war package, and configure hwi-site.xml for hwi.
2. Download ant and configure
3. Dependency packages, dependent packages come from ant, from tomacat, and tools from java.

DDL operation
Table creation is mainly divided into internal table creation and external table creation
The internal table is the specific data stored. Each time the data is loaded, the entire file is pulled, and deleting the table data will also be deleted.
The external table stores an external reference, so there is no delete source file.

Points to note when creating a table:
1. The order of parameters, such as the creation of an external table, LOCATION is the last parameter. If you put it in the middle, you will definitely report an EOF miss error. The order is, table (field) partition, bucket (sort) rule type Location
2. The scope of the path, generally imported file path=hdfs:///localhost:9000/user/root+'path', which means that its prefix is already specified. This parameter can also be configured. The path of LOCATION has no prefix. If the directory is in hdfs:///localhost:9000/user/external, you need to write /user/external.
3. The file format and data rules must be written. Because the format of the data you load into is not recognized, then it will be all null. FIELDS TERMINATED BY '' LINES TERMINATED BY'\n'. This means that the column and column data are separated by '' and each row of data is separated by /n.

Modify the table
Rename
ALTER TABLE old_name RENAME TO new_name
Add column
ALTER TABLE table_name ADD COLUMNS (new_col INT COMMENT'note the quotes)
Change table attributes
ALTER TABLE table_name SET FILEFORMAT TEXTFILE (change storage method)
ALTER TABLE table_name SET serdeproperties('field.delim'='\t') (change split method)
alter table table_name set tblproperties('EXTERNAL'='TRUE') (internal table to external table)

Partition operation
Add partition
ALTER TABLE table_name ADD PARTITION (name=value)
If there is no partition with this name field in this table: ValidationFailureSemanticException table is not partitioned but partition spec exists: {id=1}
The main point is that this is not a partition table, so it is necessary to confirm this partition table when building the table before you can add partitions in the future.
ALTER TABLE table_name DROP PARTITION (name=value) (Delete a partition)

Delete table
DROP TABLE table_name
TRUNCATE TABLE my_table (only delete the data in the table, without changing the table structure)
truncate cannot delete an external table because the data of the external table is not stored in the Meta Store

DML operation
I want to import data from a data table
LOAD DATA [LOCAL] INPATH '' [OVERWRITE] INTO table_name
If it is a file in hdfs, local is not required. Overwrite into is to overwrite the table partition, only the data content of this partition, if it is appended, there is no need to overwrite.
hive will move the file in the directory to the table, if the file name conflicts, it will replace the long file with the new file.

Insert query results
INSERT OVERWRITE TABLE table_name [partition] SELECT ....
You can insert subsequent query results into a table or table partition. Overwrite will force writing, and the output format and serialization method are determined by the metadata of the table. At the same time, we can also use multi-table insertion, which can reduce the number of scans.

Write query results to the file system
INSERT OVERWRITE [LOCAL] DIRECTORY dir SELECT...
dir can be a full path. If no scheme or authority is defined, hive will be defined using fs.default.name configured by hadoop. For example, my default dir prefix is hdfs://localhost:9000/user/root

sql operation
There is no big difference between SELECT WHERE and sql statements
Such as select * from user where id=1

Duplicate handling
In the default select name from user, the returned name is all, which includes all (including duplicate items)
If you want to filter out these items, use select distinct name from user

Use regular
https://www.cnblogs.com/Allen-rg/p/9323506.html
Use regular to filter data.

Query based on partition
select * from table_name where table_name.partition<100
The table_name table here is partitioned by partition, then here will not perform a global scan, but find the partition that meets the conditions, and then query

polymerization
Group by This function can produce aggregate operations,
Such as select name from user group by name.
This operation will be grouped by name, but in this process, the values of other columns may become multi-valued, such as name=Zhangsan’s two, then the id column corresponding to the name row has two values. For This is not possible with relational databases. It is necessary to operate on the id to make it a value. Such as max, count and other operations.

join operation
Partitioning
The table is divided into partitions, partition will be partitioned according to the partition field. Partitioning can make part of the data query faster. Tables or partitions can be bucketed into one, and extra data structures are usually added to the original data to make efficient queries. For example, the user can increase the ID bucket.

The partition can be a time or a timestamp. When querying with this time query, of course, it will not be scanned globally (is it very similar to the mysql index). Partitions can also be multi-dimensional partitions, so that a directory-level relationship can be established. Such as day time and time to partition. Under the general catalog is the day, below the day is the time.
Dynamic partition is turned on:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
Default value: strict
Description: strict is to avoid that the full partition field is dynamic, there must be at least one partition field is specified to avoid a large number of partitions

The reason for using buckets is for efficient query and efficient sampling. The specified field will take the remainder after hashing to decide which bucket to enter. In the map-side association, the buckets on both sides do not have to be the same, just a multiple. The system also gave a hive.enforce.bucketing=true way to hand over the buckets to hive.set hive.enforce.sorting=true; turn on forced sorting, inserting data into the table will force sorting, the default is false;

We found that the concept of buckets is the concept of MapReduce partitions, and they are exactly the same. Physically, each bucket is a file in the directory, and the number of buckets (output files) generated by a job is the same as the number of reduce tasks.
The concept of partitioned tables is a new concept. The partition represents the data warehouse, that is, the folder directory. Different data files can be placed under each folder. You can query the files stored in it through the folder. But the folder itself has nothing to do with the content of the data.
Buckets are bucketed according to a certain value of the data content, and a large file hash is called a small file.
These small files can be sorted individually. If another table is divided into small files according to the same rules. When two tables are joined, it is not necessary to scan the entire table, just match the data of the same bucket. The efficiency is of course greatly improved.
Similarly, when sampling data, there is no need to scan the entire file. You only need to extract part of the data according to the same rules for each partition.
Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.