"Go" HiveQL: Learning about data definition

Source: Internet
Author: User

Databases in 1.Hive:

It is a table directory or namespace that avoids table naming conflicts, and we typically use databases to organize production tables into logical groups.

Basic command:

(1) Create a database (if the database does not exist):

Create database if not exists time;

(2) View all the databases contained in hive:

show databases;

Note: Hive creates a directory for each database, and the table in the database is stored as a subdirectory of the database directory, with an exception to the table in the default database. Tables in the default directory are stored directly under Hive.metastore.warehouse.dir.

We can set the property hive.cli.print.current.db=true to let the CLI of hive display indicate which database is currently working under:

As above, first in the default directory, using the use time, after the command switch to the time database.

2. Storage of data in hive:

Divided into meta-data, "real" data and logs:

(1) Metadata: By default, the hive metadata is saved in the built-in Derby database, allowing only one session connection and only for simple testing. To support multi-user

, I need a separate meta-database, I use MySQL as a meta-database, the Hive inside of the MySQL provides a good support.

We can go into MySQL to view the metadata for the table in hive:

The default metadata tables are:

The main related tables are:

For example: We view the TBLs table for hive metadata information in MySQL:

As a control, we look at whether there are log tables and Hive_test tables in hive:

Note: There are three modes during the installation of hive:

Built-in mode: metadata remains in the inline derby mode, allowing only one session to connect

Local Standalone mode: Install MySQL locally and put the metadata inside MySQL

Remote mode: Metadata is placed in the remote MySQL database

(2) "real" data: exists on HDFs.

(3) Log: There is a local/tmp/${user_name}/hive.log (for example:/tmp/hadoop/hive.log is the author's storage path)

3. Tables in hive: divided into internal tables (also called management tables) and external tables.

(1) If you create an internal table, the data is moved to the path that the data warehouse points to. When the internal table is deleted, the metadata and data of the internal table are deleted together;

(2) If an external table is created, only the path of the data is recorded, and no changes are made to the location of the data (that is, the data already exists on HDFS, and the newly created table points to it). External tables only delete metadata and do not delete data. So the outer table is relative to

More secure, the data organization is more flexible and facilitates the sharing of source data.

4. Partitioning the table:

(1) In a hive select query, the entire table content is generally scanned, which consumes a lot of time to do unnecessary work. Sometimes you just need to scan a subset of the data in the table

This built-in table introduces the partition concept-faster queries for improved performance!

(2) A partitioned table refers to the partition space of the partition specified when the table was created. If you need to create a partitioned table, you need to call the optional parameter when creating the table

Partitioned by.

Eg:create Table Employees (name string, salary string) partitioned by (country string, state string);

At this point we just declare that the Employees table has a partition structure, but we have not created the partition, we can create the partition by loading the data, or we can add, modify, and delete the partition through the ALTER command. The following author uses the first method to create a partition and upload data:

Eg:load data local inpath '/home/hadoop/extend/data_employees ' into table employees partition (country= ' US ', state= ' CA ') ;

Hive creates the corresponding directory for this partition and uploads the file to this directory.

Note: Hive does not verify that the data loaded by the user and the schema of the table match. However, Hive verifies that the file format is consistent with the table structure definition, that is, if the storage format defined when the table is created is Sequencefile, the loaded file must also be in the Sequencefile format.

"Go" HiveQL: Learning about data definition

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.