Execute SQL statements using hive or Impala to manipulate data stored in HBase

Source: Internet
Author: User

Execute SQL statements using hive or Impala to manipulate data stored in HBase

HiveImpalaHBase HiveQL大数据

    • Execute SQL statements using hive or Impala to manipulate data stored in HBase
      • 0. Abstract
      • First, the basic environment
      • Ii. data stored in HBase, using hive to execute SQL statements
        • Ⅰ, creating hive external tables
        • Ⅱ, reading from HBase
        • Ⅲ, write to HBase
      • Iii. data stored in HBase, using Impala to execute SQL statements
        • Ⅰ, reading from HBase
        • Ⅱ, write to HBase
      • Iv. Summary of the above

0. Abstract

Hive is a Hadoop-based data warehousing tool that maps structured data files into a single database table and provides simple SQL query functionality that translates SQL statements into MapReduce tasks.
HBase (Hadoop Database) is a highly reliable, high-performance, column-oriented, scalable, distributed storage system that can only fetch data through rowkey and cannot make SQL queries.
Therefore, if hive can fetch data from hbase and combine the SQL query function of hive, it can do more complicated SQL query operation.
Impala provides an interactive SQL for direct querying of data stored in HDFs, HBase. In addition to using the same unified storage platform as Hive, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC Driver and user interface (Hue beeswax). Impala also offers a familiar platform for batch or real-time queries and unified platforms.

Achieve the goal: 1, support hbase Multi-table join query and other more complex SQL query operations.

First, the basic environment

CentOS-6.5
JDK-1.7
hive:hive-0.13.1+cdh5.3.6+397,impala:impala-2.1.5+cdh5.3.6+0,spark:spark-1.2.0+cdh5.3.6+379,
elasticsearch:elasticsearch-1.7.1,elasticsearch-hadoop-hive:elasticsearch-hadoop-hive-2.1.1

Ii. data stored in HBase, using hive to execute SQL statements

Whether using hive or Impala or spark, the first step is to let them know the table structure of the data first.
The built-in data types of hive can be divided into two main categories: (1), underlying data type, (2), and complex data types. The underlying data types are: TINYINT, SMALLINT, INT, BIGINT, BOOLEAN, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP, DECIMAL, CHAR, VARCHAR, DATE.

Ⅰ, creating hive external tables
    1. HBase is hive 
      There is already a table named Hbaseanqi in HBase, with a column family named Angel and five columns in the list family: name, visible, Cdecimal, CDate, Position  
      need to create a table in hive named Anqi with five columns in it: ID of type bigint, string name, Boolean visible, DECIMAL (12,2) type Cdecimal, Timestamp type of CDate.  
      The ID, name, visible, Cdecimal, and column cdate of the Anqi table in hive are derived from Hbaseanqi and rowkey of angel:name tables in HBase, angel:visible, Angel:cdecimal, angel:cdate column (Angel is the column family name).  
      then execute the following statement to create an external table in hive:
  
 
  1. CREATE EXTERNAL TABLE default.hive_hbase_anqi(
  2. id BIGINT,
  3. name STRING,
  4. visible BOOLEAN,
  5. cdecimal DECIMAL(12,2),
  6. cdate TIMESTAMP)
  7. ROW FORMAT SERDE ‘org.apache.hadoop.hive.hbase.HBaseSerDe‘
  8. STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler‘
  9. with serdeproperties ( "hbase.columns.mapping" = ": Key, Angel : Name, angel:visible, Angel:cdecimal, angel:cdate " )
  10. TBLPROPERTIES("hbase.table.name" = "hbaseanqi");

As you can see, the column mapping relationship between hive and HBase is specified through the WITH Serdeproperties in the statement. With the above statement, you see a properly structured table in hive and you can see the data in HBase in Hue.

note :  
all the fields in the Hive table that were created are lowercase. This is the same as es-hive. For more information, use hive or Impala to execute SQL statements and manipulate data stored in Elasticsearch.

    1. First hive, then HBase.
      Because hive needs to go to hbase to get some information when executing the above statement in hive, executing the above statement throws an exception directly if there are no related tables in HBase: HBase table Hbaseanqi doesn ' t exist while the table is Declared as an external table.
      Therefore, in the absence of an hbase table, you cannot create a hive counterpart table.
      This is not the case when associating with Elasticsearch and hive. Hive can still create tables successfully when there is no corresponding Index-type (index-type) in Elasticsearch. For details, you can view: Execute SQL statements using hive or Impala, and manipulate the data stored in Elasticsearch.
Ⅱ, reading from HBase

We have chosen a few representative SQL query statements to verify that using hive can perform the correct query operation on the data stored in HBase.

  1. Sum of decimal type data
    SELECT SUM(cdecimal) FROM default.hive_hbase_anqi;
    It is proved that the exact summation value can be obtained correctly.

  2. Timestamp type data sorting
    SELECT * from default.hive_hbase_anqi ORDER BY cdate limit 5;
    It is proven that the timestamp values can be sorted correctly.

  3. Common table and Appearance Association
    select * from hive_hbase_anqi,sample_08 WHERE hive_hbase_anqi.id=sample_08.salary;
    Where the sample_08 table is a hive plain table, Hive_hbase_anqi is the hive façade in which data is stored in hbase.
    It is verified that by executing the above statement, the Hive normal table can be properly associated with the hbase appearance of Hive.

Ⅲ, write to HBase

Insert OVERWRITE table Hive_hbase_anqi Select salary,dtime,sdecimal,sname,svisible from sample_09;
Import the data from Hive Plain table sample_09 into the Hive external table Anqi, go to hbase and see that the data is imported correctly.
Hive Four Data import method: Import data from the local file system to hive table, import data from HDFs to hive table, query the corresponding data from other tables and import into Hive table, create table by querying the corresponding records from other table and inserting into the table created.
Writing to HBase is not very convenient.

Iii. data stored in HBase, using Impala to execute SQL statements

Executing in Impala INVALIDATE METADATA , the REFRESH table_name statement allows Impala to identify a table that has been created or has changed in hive.

Ⅰ, reading from HBase

The table created in hive, Impala can be read directly using. When executing complex SQL statements, Impala is significantly faster than hive.
Similarly, we select a typical SQL statement from the above (2) section to verify that the Impala-sql can return the results quickly and correctly, and the results are consistent with the Hive-sql execution results.

Ⅱ, write to HBase

Impala's INSERT data SQL statement differs significantly from Hive, and Impala is closer to standard SQL.

    1. normal add line  
      insert into Hive_hbase_anqi values (0, ' 1999-09-09 ', 8.88, ' name ', true);  
      by executing this statement, you can add a row of data to HBase. Note the order in values, the first of the IDs, and other columns sorted by column name.
    2. import   from other tables in HBase,
      insert to Hive_hbase_anqi (id,name,cdate) select Id,name,udate from Hive_hbase_xi;  
      by executing this statement, you can add a batch of data from the Hive_hbase_xi table to HBase.
Iv. Summary of the above

Data is stored in hbase, and it is a good idea to use Impala or hive to execute SQL statements. While using Impala and using hive can also be done correctly, using Impala is more convenient and efficient than using hive. Goals.

author @ Angel Wang

[Email protected]

Execute SQL statements using hive or Impala to manipulate data stored in HBase

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.