Basic Hive learning documents and tutorials

Source: Internet
Author: User
Hive is a basic data warehouse architecture built on Hadoop. It provides a series of tools for data extraction, conversion, and loading.

Hive is a basic data warehouse architecture built on Hadoop. It provides a series of tools for data extraction, conversion, and loading.

Basic Hive learning documents and tutorials

Abstract:

Hive is the basic architecture of data warehouse built on Hadoop. It provides a series of tools for data extraction, conversion, and loading (ETL). This is a mechanism for storing, querying, and analyzing large-scale data stored in Hadoop. Hive defines a simple SQL-like query language called QL, which allows users familiar with SQL to query data. At the same time, this language also allows developers familiar with MapReduce to develop custom mapper and reducer to handle complicated analysis tasks that cannot be completed by built-in mapper and reducer.

Directory:

  • HIVE Structure
  • HIVE metabase
  • DERBY
  • Mysql
  • HIVE Data Storage
  • Other HIVE operations
  • Basic HIVE operations
  • Createtable
  • AlterTable
  • Insert
  • Inserting data into HiveTables from queries
  • Writing data into filesystem from queries
  • Hive Select
  • GroupBy
  • OrderSort
  • Hive Join
  • HIVE parameter settings
  • HIVE UDF
  • Basic functions
  • UDTF
  • Explode
  • HIVE MAPREDUCE
  • Notes for using HIVE
  • Insert
  • Optimization
  • HADOOP computing framework features
  • Common Optimization Methods
  • Full sorting
  • Example 1
  • Example 2
  • JOIN
  • JOIN Principle
  • Map Join
  • HIVE FAQ
  • Common Reference Path
  • 1. HIVE Structure

    Hive is the basic architecture of data warehouse built on Hadoop. It provides a series of tools for data extraction, conversion, and loading (ETL). This is a mechanism for storing, querying, and analyzing large-scale data stored in Hadoop. Hive defines a simple SQL-like query language called QL, which allows users familiar with SQL to query data. At the same time, this language also allows developers familiar with MapReduce to develop custom mapper and reducer to handle complicated analysis tasks that cannot be completed by built-in mapper and reducer.

    1.1 HIVE Architecture

    The Hive structure can be divided into the following parts:

    · User interfaces: including CLI, Client, and WUI

    · Metadata storage. It is usually stored in relational databases such as mysql and derby.

    · Interpreter, compiler, optimizer, and executor

    · Hadoop: Uses HDFS for storage and MapReduce for computing

    1. There are three main user interfaces: CLI, Client, and WUI. The most common one is CLI. When Cli is started, a Hive copy is started at the same time. The Client is the Hive Client, and the user connects to the Hive Server. When starting the Client mode, you must specify the node where the Hive Server is located and start the Hive Server on the node. WUI accesses Hive through a browser.

    2. Hive stores metadata in databases, such as mysql and derby. The metadata in Hive includes the table name, the table's columns and partitions and their attributes, the table's attributes (whether it is an external table or not), and the table's data directory.

    3. the interpreter, compiler, and optimizer complete the generation of HQL query statements from lexical analysis, syntax analysis, compilation, optimization, and query plan. The generated query plan is stored in HDFS and executed by MapReduce later.

    4. Hive data is stored in HDFS. Most of the queries are completed by MapReduce (including * queries. For example, select * from tbl does not generate MapRedcue tasks ).

    1.2 Hive
    Relationship with Hadoop

    Hive is built on Hadoop,

    · Hive completes interpretation, optimization, and generation of query plans for query statements in HQL.

    · All data is stored in Hadoop.

    · The query plan is converted to MapReduce tasks and executed in Hadoop (some queries do not have MR tasks, such as select * from table)

    Hadoop and Hive are both coded in UTF-8.

    1.3 Hive
    Similarities and differences with common relational databases

    HiveRDBMS

    Query Language HQLSQL

    HDFSRaw Device or Local FS

    No index available

    Run MapReduceExcutor.

    Execution latency

    Data processing scale

    1. Query Language. SQL is widely used in data warehouses. Therefore, HQL is a query language for SQL. Developers familiar with SQL development can easily use Hive for development.

    2. Data storage location. Hive is built on Hadoop, and all Hive data is stored in HDFS. Databases can store data on Block devices or local file systems.

    3. data format. No special data format is defined in Hive. You can specify the data format. You must specify three attributes for the custom data format: column separator (usually space, "\ t", "\ x001"), line separator ("\ n ") and how to read file data (Hive has three default file formats: TextFile, SequenceFile, and RCFile ). During data loading, you do not need to convert the user data format to the data format defined by Hive. Therefore, Hive does not modify the data itself during the loading process, instead, you only need to copy or move the data content to the corresponding HDFS.
    Directory. In databases, different databases have different storage engines and define their own data formats. All data is stored according to a certain organization. Therefore, the process of loading data from a database is time-consuming.

    4. Data Update. Because Hive is designed for Data Warehouse applications, data warehouse content is read-write-less. Therefore, Hive does not support data rewriting and addition. All data is determined during loading. The data in the database usually needs to be modified frequently, so you can use insert... Add data using VALUES, UPDATE... SET to modify data.

    5. index. As we have already said, Hive does not process or even scan data during data loading, so it does not index some keys in the data. When Hive wants to access a specific value that meets the conditions in the data, it needs to scan the entire data brute force, so the access latency is high. Due to the introduction of MapReduce, Hive can access data in parallel. Therefore, even if there is no index, Hive can still show its advantages in accessing large data volumes. In a database, indexes are usually created for one or more columns. Therefore, for access to a small amount of data with specific conditions, the database can have high efficiency and low latency. The high latency of data access determines
    Hive is not suitable for online data query.

    6. Execute. In Hive, most queries are executed through MapReduce provided by Hadoop (similar to select * from tbl queries do not require MapReduce ). Databases usually have their own execution engines.

    Contact Us

    The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

    If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

    A Free Trial That Lets You Build Big!

    Start building with 50+ products and up to 12 months usage for Elastic Compute Service

    • Sales Support

      1 on 1 presale consultation

    • After-Sales Support

      24/7 Technical Support 6 Free Tickets per Quarter Faster Response

    • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.