Hadoop (vii)--sub-project Hive

Source: Internet
Author: User
Tags mathematical functions odbc mysql database

We introduced the two basic pillars of HDFs and MapReduce for the Hadoop project, and then introduced the subproject Pig: a MapReduce upper-level client that processes data under HDFS in a SQL-like, data-flow-oriented language. This greatly satisfies those programmers who do not have Java and do not write MapReduce. But it is tricky for data analysts, DBAs, and so on, who have previously been working on relational database data analysis such as Oracle. Another sub-project of Hadoop, hive, solves this problem.


Well, first look at the context of this blog:


One, hive concept:hive is a Hadoop-based data warehousing tool that maps structured data files into a database table and provides simple SQL query functionality. You can convert the SQL statement to a MapReduce task to run.   The advantage is that the learning cost is low, the simple mapreduce statistics can be quickly realized through the class SQL statement (Hive QL), and it is very suitable for statistical analysis of data Warehouse without developing specialized mapreduce applications. We can consider it as a mapper from SQL to MapReduce.


Two, Hive installation: first know that hive organizes the data in HDFs into a table in this way, assigning the structure to the HDFS data, which (for example, the table schema) is called hive metadata, which is stored in Metastore. Depending on the location of the Metastore storage, we can be divided into three modes of installation:


1, inline mode: The Metastore service and Hive service run in the same JVM, containing an inline derby db instance that is stored as a local disk. This type of installation is simple and is suitable for learning, allowing only one session connection.


2, local standalone mode: storing metadata in a separate database, MySQL is a popular choice, using the Metastore service to connect to a locally installed MySQL database, supporting multi-session multi-user connections.


3, Remote mode: Metadata is placed in a remote MySQL database so that one or more metastore services and hive services run in different processes.

OK, the installation of hive is relatively simple inside the sub-project, here look at the installation of the embedded mode:


A, download and unzip to the user directory:

Tar Xzf./apache-hive-1.2.1-bin.tar.gz

The extracted directories and other items are similar and are no longer described:



B, set environment variables:

Exporthive_home=/home/ljh/apache-hive-1.2.1-bin

Exportpath= $PATH: $HIVE _home/bin

Exportclasspath= $CLASSPATH: $HIVE _home/bin

C, configuration file settings:

c.1,hive-env.sh

Copy: CP hive-env.sh.template hive-env.sh

Set hadoop_home:hadoop_home=/home/ljh/hadoop-1.2.1

To set the configuration file path for HIVE: Export hive_conf_dir=/home/ljh/apache-hive-1.2.1-bin/conf

C.2,hive-site.xml

Copy: CP hive-default.xml.template Hive-site.xml

Note: Inline mode does not need to be configured here, if it is standalone mode, remote mode, we need to configure MySQL and so on, we can understand the parameters by Baidu,google.

D, start hive:

./hive can.

Other ways to install the reference:

Http://sishuok.com/forum/blogPost/list/6221.html

http://blog.csdn.net/xqj198404/article/details/9109715


Three, common SQL statements to operate hive , No care about the creation of the table, delete, the data of the increase (delete and change is actually increased operation), characteristics and the data type inside see this blog: http://blog.csdn.net/chenxingzhen001/article/details/20901045


1, Build table: Create Tabletest (ID string, name string)

ROW FORMAT delimited fieldsterminated by ' | '

STORED as Textfile

2. Load the data file in HDFs into the table:

LOAD DATA LOCAL inpath './examples/files/test.txt ' OVERWRITE into TABLE test;

3. Insert the query results into the table, and filter the data in the two hive tables to insert:

Insert Overwrite Tabletest2 select Id,name from test where idis not null;

4, query: Select ID, name from test;


5, Table connection: Select Test.id test1,name from Test join Test1 on (test.id=test1.id);

Here is just a simple operation, the specific hive of SQL syntax, can refer to this article, written very well and very comprehensive:

Http://www.cnblogs.com/HondaHsu/p/4346354.html


Four, Hive architecture : One of the most classic pictures:


4.1, the basic composition:

• User interface, including Cli,jdbc/odbc,webui

• Metadata storage, typically stored in relational databases such as MySQL, Derby

• Interpreter, compiler, optimizer, actuator

Hadoop: Use HDFS for storage and compute with MapReduce

4.2, the basic functions of each component:

a, there are three main user interface: Cli,jdbc/odbc and WebUI

CLI, which is the shell command line

Jdbc/odbc is the Java of Hive, similar to using the traditional database JDBC, which operates through Java.

WebGui is accessing hive through a browser

B, hive stores the metadata in the database, the metadata in hive includes the name of the table, the table's columns and partitions and their properties, the table's properties (whether external tables, etc.), the table's data directory, and so on.


Metastore

Metastore is a system catalog (catalog) that holds metadata (metadata) information for tables stored in hive;


Metastore is a feature of different similar systems when hive is used as a traditional database solution (such as Oracle and DB2);


Metastore contains the following sections:

database is the namespace of the table. The default database is named ' Default ';

The original data of the Table table (table) contains information such as: column (List ofcolumns) and their type (types), owner (owner), storage space (storage), and serdei information;

partition each partition (partition) has its own column (columns), Serde, and storage space (storage). This feature will be used to support schema evolution in hive (schemaevolution);

C, interpreter, compiler, Optimizer complete HQL query statement from lexical analysis, parsing, compiling, optimization and query plan generation. The generated query plan is stored in HDFS, and then a MapReduce call is executed


Compiler

The driver invokes the compiler (compiler) to handle the HIVEQL string, which may be a DDL, DML, or query statement;


• The compiler converts a string into a policy (plan):

• Policies consist only of metadata operations and HDFS operations, metadata operations contain only DDL statements, and HDFS operations contain only load statements

• For inserts and queries, the policy consists of a non-circular graph (Directedacyclicgraph,dag) with a direction in the Map-reduce task

D, Hive data is stored in HDFs, and most queries are completed by MapReduce (queries that contain *, such as SELECT * from table, do not generate mapredcue tasks)




Five, user-defined function (user-defined function,udf) , actually think of our Java Project tool class, we write our own function, when we write the hive QL can be called directly, convenient for us to query. UDF writing is written in the Java language, and hive itself is written in Java. There are three types of hive:


1, (normal) UDF: operates on a single data row, and produces a data row as output, most functions (such as mathematical functions and string functions) fall into this category (most commonly).


2,udaf (user-definedaggregate function): User-defined aggregation function: accepts multiple input data rows and produces an output data row. Like Count and Max.


3,UDTF (user-defined table-generating function): A user-defined table-generating function that operates on a single row of data and produces multiple rows of data-a table as output.

As for UDF writing: Steps inherit UDF, override evaluate method i.e. custom logic--"package as jar file--" in hive

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.