Execute SQL statements using hive or Impala to manipulate data stored in HBase
Hive
Impala
HBase
HiveQL
大数据
- Execute SQL statements using hive or Impala to manipulate data stored in HBase
- 0. Abstract
- First, the basic environment
- Ii. data stored in HBase, using hive to execute SQL statements
- Ⅰ, creating hive external tables
- Ⅱ, reading from HBase
- Ⅲ, write to HBase
- Iii. data stored in HBase, using Impala to execute SQL statements
- Ⅰ, reading from HBase
- Ⅱ, write to HBase
- Iv. Summary of the above
0. Abstract
Hive is a Hadoop-based data warehousing tool that maps structured data files into a single database table and provides simple SQL query functionality that translates SQL statements into MapReduce tasks.
HBase (Hadoop Database) is a highly reliable, high-performance, column-oriented, scalable, distributed storage system that can only fetch data through rowkey and cannot make SQL queries.
Therefore, if hive can fetch data from hbase and combine the SQL query function of hive, it can do more complicated SQL query operation.
Impala provides an interactive SQL for direct querying of data stored in HDFs, HBase. In addition to using the same unified storage platform as Hive, Impala also uses the same metadata, SQL syntax (Hive SQL), ODBC Driver and user interface (Hue beeswax). Impala also offers a familiar platform for batch or real-time queries and unified platforms.
Achieve the goal: 1, support hbase Multi-table join query and other more complex SQL query operations.
First, the basic environment
CentOS-6.5
JDK-1.7
hive:hive-0.13.1+cdh5.3.6+397,impala:impala-2.1.5+cdh5.3.6+0,spark:spark-1.2.0+cdh5.3.6+379,
elasticsearch:elasticsearch-1.7.1,elasticsearch-hadoop-hive:elasticsearch-hadoop-hive-2.1.1
Ii. data stored in HBase, using hive to execute SQL statements
Whether using hive or Impala or spark, the first step is to let them know the table structure of the data first.
The built-in data types of hive can be divided into two main categories: (1), underlying data type, (2), and complex data types. The underlying data types are: TINYINT, SMALLINT, INT, BIGINT, BOOLEAN, FLOAT, DOUBLE, STRING, BINARY, TIMESTAMP, DECIMAL, CHAR, VARCHAR, DATE.
Ⅰ, creating hive external tables
- HBase is hive
There is already a table named Hbaseanqi in HBase, with a column family named Angel and five columns in the list family: name, visible, Cdecimal, CDate, Position
need to create a table in hive named Anqi with five columns in it: ID of type bigint, string name, Boolean visible, DECIMAL (12,2) type Cdecimal, Timestamp type of CDate.  
The ID, name, visible, Cdecimal, and column cdate of the Anqi table in hive are derived from Hbaseanqi and rowkey of angel:name tables in HBase, angel:visible, Angel:cdecimal, angel:cdate column (Angel is the column family name).
then execute the following statement to create an external table in hive:
CREATE EXTERNAL TABLE default.hive_hbase_anqi(
id BIGINT,
name STRING,
visible BOOLEAN,
cdecimal DECIMAL(12,2),
cdate TIMESTAMP)
ROW FORMAT SERDE ‘org.apache.hadoop.hive.hbase.HBaseSerDe‘
STORED BY ‘org.apache.hadoop.hive.hbase.HBaseStorageHandler‘
with serdeproperties ( "hbase.columns.mapping" = ": Key, Angel : Name, angel:visible, Angel:cdecimal, angel:cdate " )
TBLPROPERTIES("hbase.table.name" = "hbaseanqi");
As you can see, the column mapping relationship between hive and HBase is specified through the WITH Serdeproperties in the statement. With the above statement, you see a properly structured table in hive and you can see the data in HBase in Hue.
note :
all the fields in the Hive table that were created are lowercase. This is the same as es-hive. For more information, use hive or Impala to execute SQL statements and manipulate data stored in Elasticsearch.
- First hive, then HBase.
Because hive needs to go to hbase to get some information when executing the above statement in hive, executing the above statement throws an exception directly if there are no related tables in HBase: HBase table Hbaseanqi doesn ' t exist while the table is Declared as an external table.
Therefore, in the absence of an hbase table, you cannot create a hive counterpart table.
This is not the case when associating with Elasticsearch and hive. Hive can still create tables successfully when there is no corresponding Index-type (index-type) in Elasticsearch. For details, you can view: Execute SQL statements using hive or Impala, and manipulate the data stored in Elasticsearch.
Ⅱ, reading from HBase
We have chosen a few representative SQL query statements to verify that using hive can perform the correct query operation on the data stored in HBase.
Sum of decimal type data
SELECT SUM(cdecimal) FROM default.hive_hbase_anqi;
It is proved that the exact summation value can be obtained correctly.
Timestamp type data sorting
SELECT * from default.hive_hbase_anqi ORDER BY cdate limit 5;
It is proven that the timestamp values can be sorted correctly.
Common table and Appearance Association
select * from hive_hbase_anqi,sample_08 WHERE hive_hbase_anqi.id=sample_08.salary;
Where the sample_08 table is a hive plain table, Hive_hbase_anqi is the hive façade in which data is stored in hbase.
It is verified that by executing the above statement, the Hive normal table can be properly associated with the hbase appearance of Hive.
Ⅲ, write to HBase
Insert OVERWRITE table Hive_hbase_anqi Select salary,dtime,sdecimal,sname,svisible from sample_09;
Import the data from Hive Plain table sample_09 into the Hive external table Anqi, go to hbase and see that the data is imported correctly.
Hive Four Data import method: Import data from the local file system to hive table, import data from HDFs to hive table, query the corresponding data from other tables and import into Hive table, create table by querying the corresponding records from other table and inserting into the table created.
Writing to HBase is not very convenient.
Iii. data stored in HBase, using Impala to execute SQL statements
Executing in Impala INVALIDATE METADATA
, the REFRESH table_name
statement allows Impala to identify a table that has been created or has changed in hive.
Ⅰ, reading from HBase
The table created in hive, Impala can be read directly using. When executing complex SQL statements, Impala is significantly faster than hive.
Similarly, we select a typical SQL statement from the above (2) section to verify that the Impala-sql can return the results quickly and correctly, and the results are consistent with the Hive-sql execution results.
Ⅱ, write to HBase
Impala's INSERT data SQL statement differs significantly from Hive, and Impala is closer to standard SQL.
- normal add line
insert into Hive_hbase_anqi values (0, ' 1999-09-09 ', 8.88, ' name ', true);
by executing this statement, you can add a row of data to HBase. Note the order in values, the first of the IDs, and other columns sorted by column name.
- import from other tables in HBase,
insert to Hive_hbase_anqi (id,name,cdate) select Id,name,udate from Hive_hbase_xi;
by executing this statement, you can add a batch of data from the Hive_hbase_xi table to HBase.
Iv. Summary of the above
Data is stored in hbase, and it is a good idea to use Impala or hive to execute SQL statements. While using Impala and using hive can also be done correctly, using Impala is more convenient and efficient than using hive. Goals.
author @ Angel Wang
[Email protected]
Execute SQL statements using hive or Impala to manipulate data stored in HBase