"Hadoop Technology Blog recommended" Hive.

Source: Internet
Author: User
Keywords Nbsp; name can this DFS

What the hell is hive?

Hive was originally created and developed in response to the need for management and machine learning from the massive new social network data generated by Facebook every day. So what exactly is the definition of the Hive,hive website wiki is as follows:

The Apache Hive Data Warehouse software provides query and management of large datasets stored in the distribution itself, built on Apache Hadoop only, providing the following features:

It provides a range of tools that can be used to extract/transform/load (ETL) data;

A mechanism for storing, querying, and analyzing large-scale data stored in HDFs (or hbase);

The query is done through MapReduce. (Not all queries require MapReduce to be done, such as SELECT * from XXX), and Hive0.11 queries similar to select A,b from XXX can be configured without mapreduce through configuration, see the Hive: Simple query enables fetch task without mapreduce job enabled)

As we can see from the above definition, hive is a data Warehouse architecture built on the Hadoop file system and analyzes and manages the data stored in HDFs; So how do we analyze and manage that data?

Hive defines a SQL-like query language, known as HQL, where users who are familiar with SQL can use hive to query data directly. At the same time, the language allows familiar MapReduce developers to develop custom mappers and reducers to handle the complex analytical work that the built-in mappers and reducers cannot complete. Hive can allow users to write their own defined function UDF to use in queries. There are 3 kinds of udf:user tabbed functions (UDF), User tabbed Aggregation (functions), user Udaf Table tabbed in Hive Functions (UDTF).

Today, Hive is already a successful Apache project, which many organizations use as a general-purpose, scalable data-processing platform.

Of course, there is a big difference between hive and traditional relational databases, hive to resolve external tasks into a mapreduce executable plan, and startup MapReduce is a high latency event that takes a lot of time each time the task is submitted and the task is performed. This also determines that hive can only handle high latency applications (if you want to deal with low latency applications, you can consider hbase). At the same time, because the design is not the same goal, Hive currently does not support transactions; table data cannot be modified (cannot be updated, deleted, inserted; Can append data to, re-import data only); You cannot index columns (but hive supports indexing, but does not improve hive query speed.) If you want to improve the hive query speed, please learn hive partition, bucket application.

Hive Data storage Mode

Hive data is divided into table data and metadata, table data is Hive table (table) with the data, and metadata is used to store the name of the table, table columns and partitions and their properties, table properties (whether external tables, etc.), table data directory, etc. The following are described separately.

Hive Data Storage

Hive is based on a Hadoop distributed file system whose data is stored in a Hadoop Distributed file system. Hive itself is not a specialized data storage format, and does not index the data, just to tell hive data in the table when the column delimiter and row separator, hive can parse the data. So importing the data into the hive table simply moves the data to the directory where the table resides (if the data is on HDFS, but if the data is on the local file system, copy the data to the table's directory).

Hive includes the following Data models: Table (table), External table (external table), Partition (partition), Bucket (bucket).

Table: Tables in HIVE are conceptually similar to those in a relational database, and each table has a corresponding directory in HDFs to store data for the table, which is available through ${hive_home}/conf/ Hive-site.xml the Hive.metastore.warehouse.dir attribute in the configuration file, the default value for this property is/user/hive/warehouse (this directory is on HDFs), We can modify this configuration according to the actual situation. If I have a table wyp, then the/USER/HIVE/WAREHOUSE/WYP directory is created in HDFs (this assumes that HIVE.METASTORE.WAREHOUSE.DIR is configured as/user/hive/warehouse) WYP table All data is stored in this directory. This exception is the external table.

External tables: External tables and tables in hive are similar, but their data is not in the directory where their tables belong. Instead, the advantage is that if you want to delete this external table, the data that the external table points to is not deleted, it only deletes the corresponding metadata for the external table, and if you want to delete the table, All data that corresponds to the table, including metadata, is deleted.

Partitioning: In hive, each partition of a table corresponds to the corresponding table of contents, and the data for all partitions are stored in the corresponding directory. For example, the WYP table has DT and city two partitions, then the corresponding DT=20131218,CITY=BJ table directory is/USER/HIVE/WAREHOUSE/DT=20131218/CITY=BJ, all the data belonging to this partition are stored in this directory.

Bucket: Calculates its hash on the specified column, splitting the data according to the hash value, in order to be parallel, each barrel corresponds to a file (note and partition difference). For example, the WYP table ID column is dispersed to 16 barrels, first the value of the ID column is calculated hash, the corresponding hash value of 0 and 16 of the data stored in the HDFs directory is:/user/hive/warehouse/wyp/part-00000 and the HDFs directory for the data store with a hash value of 2 is:/user/hive/warehouse/wyp/part-00002.

Hive Data Abstraction structure diagram

It can be seen that the table is under the database, and the table will be partitioned, bucket, tilt data and normal data, and so on, the section below can also build barrels.

Hive Metadata

The metadata in hive includes the name of the table, the columns and partitions of the table and its attributes, the properties of the table (whether it is an external table, etc.), the table's data directory, and so on. Because the hive metadata needs to be constantly updated and modified, and the files in the HDFS system are read-less, it is obviously not possible to store hive metadata in HDFs. Hive currently stores metadata in a database, such as MySQL, Derby. We can modify the way hive metadata is stored using the following configuration:

<property>

<name>javax.jdo.option.ConnectionURL</name>

<value>jdbc:mysql://localhost:3306/hive_hdp?characterencoding=utf-8

&createDatabaseIfNotExist=true</value>

<DESCRIPTION>JDBC connect string for a JDBC metastore</description>

</property>

<property>

<name>javax.jdo.option.ConnectionDriverName</name>

<value>com.mysql.jdbc.Driver</value>

<description>driver class name for a JDBC metastore</description>

</property>

<property>

<name>javax.jdo.option.ConnectionUserName</name>

<value>root</value>

<description>username to use against Metastore database</description>

</property>

<property>

<name>javax.jdo.option.ConnectionPassword</name>

<value>123456</value>

<description>password to use against Metastore database</description>

</property>

Of course, you also need to copy the start of the database to the ${hive_home}/lib directory, so that you can store the metadata in the corresponding database.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.