Technology hive in the big data era: hive data types and Data Models

Source: Internet
Author: User
Tags types of tables hadoop fs

In the previous articleArticleI listed a simple hive operation instance, created a table test, and loaded data to this table. These operations are similar to relational database operations, we often compare hive with relational databases because many knowledge points of hive are similar to those of relational databases.

Relational databases also contain tables, partitions, and hive, which are known as HIVE data models in hive technology. This article introduces the hive data type, data model, and file storage format. This knowledge can be analogous to the knowledge of relational databases.

First, I want to talk about the hive data type.

Hive supports two data types: Atomic data and complex data.

The atomic data types include numeric, Boolean, and string, as shown in the following table:

Basic Data Type

Type

Description

Example

Tinyint

1 byte (8-bit) signed integer

1

Smallint

2-byte (16-bit) signed integer

1

Int

4-byte (32-bit) signed integer

1

Bigint

8-byte (64-bit) signed integer

1

Float

4-byte (32-bit) Single-precision floating point number

1.0

Double

8-byte (64-bit) double-precision floating point number

1.0

Boolean

True/false

True

String

String

'CIA', "xia"

 

From the table above, we can see that hive does not support the date type. In Hive, dates are represented by strings, while common date format conversion operations are performed through custom functions.

Hive is developed in Java. The basic data types in hive correspond to the basic data types in Java one by one, except for the string type. Signed integer types: tinyint, smallint, Int, and bigint are equivalent to the byte, short, Int, and long atomic types of Java respectively, they are 1-byte, 2-byte, 4-byte, and 8-byte signed integers, respectively. The float and Double Floating Point Data Types of hive correspond to the basic float and double types of Java. The boolean type of hive is equivalent to the basic data type of Java.

For Hive, the string type is equivalent to the varchar type of the database. This type is a variable string, but it cannot declare the maximum number of characters that can be stored, theoretically, it can store 2 GB of characters.

Hive supports conversion of basic types. Basic Types of low bytes can be converted to high-byte types, such as tinyint, smallint, and INT can be converted to float, all Integer, float, and string types can be converted to the double type. These conversions can be considered from the type conversion in Java, because hive is written in Java. Of course, it also supports converting high-byte type to low-byte type, which requires the use of hive's custom function cast.

Complex data types include array, map, and struct, as shown in the following table:

Complex data types

Type

Description

Example

Array

A group of ordered fields. The field type must be the same

Array (1, 2)

Map

A group of unordered key/value pairs. The key type must be atomic, the value can be of any type, the type of the same ing key must be the same, and the value type must also be the same

Map ('A', 1, 'B', 2)

Struct

A group of named fields. The field types can be different.

Struct ('A', 1, 1, 0)

 

Let's take a look at hive using complex data types to create tables:

 
Create Table complex (col1 array <int>, col2 Map <string, int>, col3 struct <A: String, B: int, C: Double> );

Query statement:

 
Select col1 [0], col2 ['B'], col3.c from complex;

Next, let's take a look at the hive data model. Hive data models include database, table, partition, and bucket. I will discuss these four data models one by one.

1.Database: It is equivalent to a namespace in a relational database. It is used to isolate user and database applications from different databases or modes, this model is supported in Versions later than hive 0.6.0. Hive provides statements such as create database dbname, use dbname, and drop database dbname.

2.Table ):Hive tables are logically composed of stored data and related metadata describing the data form in the table. Data stored in tables is stored in distributed file systems, such as HDFS and metadata is stored in relational databases. When we create a hive table that has not loaded data for the table, this table is in a distributed file system. For example, HDFS is a folder (file directory ). Two types of table friends in hive are managed tables. The data files of these tables are stored in hive data warehouses and external tables, the data files of such tables can be stored in the distributed file system outside the hive data warehouse or in the hive Data Warehouse (Note: The Hive data warehouse is a directory on HDFS, this directory is the default path for storing hive data files. It can be configured in the hive configuration file and stored in the metadata library ).

The following is an example of creating a managed table:

 

 
Create Table tuoguan_tbl (flied string); load data local inpath 'home/hadoop/test.txt 'into Table tuoguan_tbl;

Instances created from external tables:

 
Create external table external_tbl (flied string) Location '/home/hadoop/external_table'; load data local inpath 'home/hadoop/test.txt 'into Table external_tbl;

We can see that the keyword external must be added before creating an external table, and the location command must be used to specify the file storage path, if you do not use the locaction data file, it will also be placed in the hive data warehouse.

The difference between the two types of tables: The Master drop command, the drop command is the hive command to delete the table. When the managed table executes the drop command, the metadata and stored data are deleted, when an External table executes the drop command, only the data in the metadata library is deleted, instead of the stored data. In addition, I want to talk about the table load command. Hive does not check the metadata when loading data, but simply moves the file location. If the source file format is incorrect, it can only be found during the query operation. At that time, fields in the incorrect format will be displayed as null.

3.Partition): The concept of partitioning in hive is to roughly divide the table data based on the value of the "partition column, the hive storage is reflected in a sub-directory under the table's main directory (the hive table is actually displayed as a folder). The name of this folder is the name of the partition column we have defined, people without practical experience may think that the partition column is a field in the table. In fact, this is not the case. The partition column is not a field in the table, but an independent column, we store the data files in the table based on this column. Partitions are designed to speed up the query of Data partitions. We do not need to perform a full table scan when querying data in a specific partition column. Here is an example of a partition:

Create a partition:

 
Create Table logs (TS bigint, line string) partitioned by (DT string, country string );

 

Load data:

 
Local data local inpath '/home/hadoop/PAR/file01.txt 'into table logs partition (Dt = '2017-06-02', Country = 'cn ');

 

The actual storage path in the hive Data Warehouse is as follows:

 
/User/hive/warehouse/logs/dt = 2013-06-02/Country = Cn/file1.txt/user/hive/warehouse/logs/dt = 2013-06-02/Country = Cn/file2.txt/user/hive/warehouse/logs/dt = 2013-06-02/Country = US/file3.txt/user/hive/warehouse/logs/dt = 2013-06 -02/Country = US/file4.txt

 

We can see that the logs directory contains two sub-directories dt = and Country = cn.

Query operation:

 
Select ts, DT, line from logs where country = 'cn ',

In this case, our query operation only scans file1.txtand file2.txt files.

4.Bucket ):The preceding table and partition are directory-level split data, while the bucket splits data on the data file of the data source. Using a bucket table will split the source data file into multiple files according to certain rules. To use a bucket, we must first open hive to control the bucket. The command is as follows:

 
Set hive. Enforce. bucketing = true

The following is a blog post that I reference in the blog Garden:

Example: create a temporary table student_tmp and import data: hive> DESC student_tmp; okid intage intname stringstat_date stringtime taken: 0.106 secondshive> select * From student_tmp; ok1 20 zxm 201208012 21 ljz 201208013 19 CDs 201208014 18 Mac 201208015 22 Android 201208016 23 Symbian 201208017 25 WP 20120801 time taken: 0.123 seconds create student table: hive> Create Table student (ID int, age int, name string)> partitioned by (stat_date string)> clustered by (ID) sorted by (AGE) into 2 bucket> row format delimited fields terminated ','; set environment variables:> set hive. enforce. bucketing = true; insert data: & gt; from student_tmp & gt; insert overwrite table student partition (stat_date = "20120802") & gt; select ID, age, name where stat_date = "20120801" sort by age; view the file directory: $ hadoop FS-ls/user/hive/warehouse/studentstat_date = 20120802/found 2 items-RW-r -- 1 work supergroup 31/user/hive/warehouse/student /stat_date = 20120802/000000 _ 0-rw-r -- r -- 1 work supergroup 39/user/hive/warehouse/student/stat_date = 20120802/000001 _ 0

Physically, each bucket is a file in the table (or partition) directory. The bucket file is hash Based on the specified field value and divided by the number of buckets. For example, Example 2 above, finally, return the remainder, because the hash value of the integer is the integer itself. In the example above, the hash value of the field is still the field itself, so the remainder of 2 is only two zeros and one, so we can see that the suffixes of the generated files are * 0_0 and * defaults 0, and the corresponding metadata is stored in the file.

Hive buckets, I personally think there are no special scenarios or special queries, so we don't need to use them, that is, we don't need to enable hive bucket configuration. Because the bucket is used in a limited number of scenarios, one is to perform map Join Operations. As I will talk about in the following articles, the other is sampling operations, the following is an example from the blog post:

View sampling data: hive> select * from student tablesample (bucket 1 out of 2 on ID); Total mapreduce jobs = 1 Launching job 1 out of 1 ....... ok4 18 Mac 201208022 21 ljz 201208026 23 Symbian 20120802 time taken: 20.608 secondstablesample is a sample statement. Syntax: tablesample (bucket X out of Y) Y must be a multiple or factor of the total number of buckets in the table. Hive determines the sampling ratio based on the size of Y. For example, the table is divided into 64 parts in total. When Y = 32, the data of two buckets (64/32 =) is extracted. When Y = 128, the data is extracted (64/128 =) data of 1/2 buckets. X indicates the bucket from which the extraction starts. For example, if the total number of buckets in table is 32 and tablesample (bucket 3 out of 16) indicates that the data of two buckets (32/16 =) is extracted in total, data of 3rd buckets and 19 buckets (3 + 16 =) respectively.

Now, I will write it here today. I cannot write articles after work tomorrow. The content of this blog post has not been written yet (the hive storage format has not been written), because the knowledge of this chapter is very important and the key to understanding hive, so let's take a closer look, tomorrow I will try to write an article on the hive storage format. The day after tomorrow, on the second day of this week, I will introduce hive-related technologies to our technical department. writing a blog is my rehearsal.

at last, I want to talk about my views on big data technology. I think big data technology is a cross-age technology, the future of Internet technology, and the future of cloud computing, its in-depth development is not only about data processing, but also changes the ecological chain of Internet technologies, including the technologies and development languages we use. We are glad to have experienced the transformation of the entire great age, I also want to embrace the arrival of this age.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.