Preliminary knowledge of Hive

Source: Internet
Author: User
Tags null null

Hive is a data warehouse built on top of Hadoop.

1) data calculation is MapReduce

2) data storage is HDFs

Understanding Hive

Hive is a data Warehouse analysis system built on Hadoop that provides rich SQL queries to analyze data stored in a Hadoop Distributed file system, mapping structured data files into a database table , and For the full SQL query function, you can convert SQL statements to MapReduce tasks to run, through their own SQL to query the content of the analysis needs, this set of SQL short name Hive SQL, so that users unfamiliar with MapReduce easily use the SQL language query, summary Analysis. The core is still the MapReduce job.

Common Application Scenarios for hive

1. Log analysis

1) Statistics website A time period of PV, UV

2) data analysis from different dimensions

2. Offline analysis of massive structured data

The benefits of Hive

1, simple easy to start

2. It is a computing and scaling capability designed for very large datasets

3. Provide unified meta data management

The disadvantages of Hive

1. The hql of hive is limited in expressing ability

1) Iterative algorithms cannot be expressed, such as PageRank.

2) Data mining, such as Kmeans.

2, hive efficiency is low

1) The automatic generation of mapreduce jobs by hive is usually not intelligent enough.

2) Hive tuning is more difficult

3) The control of hive is relatively poor

Basic framework of Hive

Components of Hive

1. User interface

CLI, JDBC/ODBC, WebUI

2. Meta data Storage (Metastore)

Default Derby database, real-world use MySQL database

3. Drive (Driver)

Interpreter, compiler, optimizer, actuator

4. Hadoop Distributed Cluster

Using Mapreducer distributed computing, the use of HDFS distributed storage

How Hive Works

MapReduce developers can write their own Mapper and Reducer as plug-ins to support Hive for more complex data analysis. It is slightly different from SQL for relational databases, but supports most of the statements (such as DDL, DML) and common aggregation functions, connection queries, conditional queries, and so on.

Hive is not suitable for online transactions, nor does it provide real-time query functionality. It works best with batch jobs that are based on large amounts of immutable data. Hive is characterised by the scalability (dynamic addition of devices on a Hadoop cluster), the scalability, fault tolerance, and loose coupling of input formats. The ingress of Hive is DRIVER, the executed SQL statement is first committed to the DRIVER driver, then the COMPILER explanation driver is called, and finally the MapReduce task executes and the result is returned.

Hive Data Type

Hive provides basic data types and complex data types that are not available in the Java language. The following describes the two data types of Hive and the conversion between data types.

From the table above we see that hive does not support date types, and dates in hive are represented by strings, whereas commonly used date format conversion operations are performed by custom functions.

Hive is developed in Java, and the basic data type in hive is also one by one corresponding to the basic data type of Java, except for the string type. Signed integer types: TINYINT, SMALLINT, int, and bigint are equivalent to Java's byte, short, int, and long atomic types, which are 1-byte, 2-byte, 4-byte, and 8-byte signed integers, respectively. The floating-point data type of hive, float and double, corresponds to the basic type float and double types of java. The Boolean type of hive corresponds to the basic data type Boolean of java.

The string type for hive is the same as the varchar type of the database, which is a mutable string, but it cannot declare how many characters can be stored in it, theoretically it can store a character count of 2GB.

Complex data types

Hive has three complex data types, ARRAY, MAP, and STRUCT. Arrays and maps are similar to arrays and maps in Java, and structs are similar to structs in C, encapsulating a collection of named fields that allow arbitrary levels of nesting.

A declaration of a complex data type must use angle brackets to indicate the type of data field in it. Define three columns, each of which corresponds to a complex data type, as shown below.

CREATE TABLE Complex (

Col1 array< Int>

Col2 map< String,int>

Col3 struct< a:string,b:int,c:double>

)

Type conversions

The atomic data types of Hive are implicitly convertible, similar to Java type conversions, such as when an expression uses an int type, TINYINT is automatically converted to an int type, but hive does not reverse--for example, an expression uses the TINYINT type, and INT does not self- Convert to the TINYINT type, it returns an error unless the cast operation is used.

An implicit type conversion rule is as follows.

1, any integer type can be implicitly converted to a wider range of types, such as TINYINT can be converted to Int,int can be converted to BIGINT.

2. All integer types, FLOAT, and String types can be implicitly converted to DOUBLE.

3, TINYINT, SMALLINT, INT can be converted to FLOAT.

4. The BOOLEAN type cannot be converted to any other type.

You can use the cast operation display for data type conversions, such as cast (' 1 ' as int) to convert the string ' 1 ' to an integer 1, or if the cast (' X ' as int ') is enforced, the expression returns null null.

HVE Installation Deployment-Lab environment

Hive is really simple, it has only one server, not a distributed system. is a single node, we can deploy it on one node. The above is in the test environment.

HVE installation Deployment-Real world

Preliminary knowledge of Hive

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.