Integration of Impala and HBase

Source: Internet
Author: User
We know that HBase is a column-based NoSQL database that supports flexible data storage. It is a large table. In some applications, by designing RowKey, you can quickly store and access massive data. However, for complex query statistics requirements, if it is directly implemented based on HBaseAPI, the performance is very poor, Or, you can

We know that HBase is a column-based NoSQL database that supports flexible data storage. It is a large table. In some applications, by designing RowKey, you can quickly store and access massive data. However, for complex query statistics requirements, if it is directly implemented based on HBase APIs, the performance is very poor, or you can

We know that HBase is a column-based NoSQL database that supports flexible data storage. It is a large table. In some applications, by designing RowKey, you can quickly store and access massive data. However, for complex query statistics requirements, if implemented directly based on HBase APIs, the performance is very poor, or you can implement MapReduce programs for query analysis, this also inherits the latency of MapReduce.
To achieve Impala and HBase integration, we can obtain the following benefits:

  • We can use familiar SQL statements. Like traditional relational databases, it is easy to provide SQL Design for complex queries and statistical analysis.
  • Impala query statistics and analysis is much faster than native MapReduce and Hive.

To integrate Impala with HBase, You need to map the RowKey and column of HBase to the Table field of Impala. Impala uses Hive Metastore to store metadata. Similar to Hive, it is implemented through EXTERNAL tables when HBase is integrated.

Preparations

First, we need to make the following preparations:

  • Install and configure Hadoop cluster (http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_4_4.html)
  • Install and configure HBase cluster (http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_20.html)
  • Install and configure Hive (http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_18.html)
  • Install configuration Impala (http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_noncm_installation.html? Scroll = noncm_installation)

For installation and configuration of related systems, refer to relevant documents and materials.
The following describes how to integrate Impala with HBase through the example Table test_info:

Integration Process

  • Create a table in HBase

First, we use HBase Shell to create a table, as shown below:

create 'test_info', 'info'

The table name is test_info and there is only one Column Family named info. We plan to have four columns in this Column cluster: info: user_id, info: user_type, info: gender, info: birthday.

  • Create an External table in Hive

Create an External table. The corresponding DDL is as follows:

CREATE EXTERNAL TABLE sho.test_info(     user_id string,     user_type tinyint,     gender string,     birthday string)ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, info:user_type, info:gender, info:birthday")TBLPROPERTIES("hbase.table.name" = "test_info");

In the preceding DDL statement, specify the ing between Hive External table fields and HBase columns in the with serdeproperties option. ": key" corresponds to RowKey in HBase and is named "user_id ", the rest are the column names in the column cluster info. Finally, the table name to be mapped in HBase is specified in TBLPROPERTIES.

  • Synchronize metadata in Impala

Impala shares the Hive Metastore. To synchronize metadata, run the following command in Impala Shell:

INVALIDATE METADATA;

Then, you can view the structure of the table mapped to HBase:

DESC test_info;

Table Structure:

Through the above three steps, we have completed the integration configuration of Impala and HBase.

Verify Integration

Next, we will verify whether the above configuration takes effect through practice.
We simulate that the client inserts data into the HBase table, which can be implemented using HBase API or HBase Thrift. Here we use the HBase Thrift interface for operations. For details, see the HBase Thrift client Java API practice.
Then, we can use Impala Shell for query and analysis. Insert a 20000000 (20 million) record based on the example table created for integration above. Let's make a statistical analysis example. The SQL statement is as follows:

SELECT user_type, COUNT(user_id) AS cnt FROM test_info WHERE gender='M' GROUP BY user_type ORDER BY cnt DESC LIMIT 10;

Shows the running result:

The Hadoop cluster in which the preceding program runs has three Datanode instances. The total execution time of the preceding statistical SQL statements is 88.13 s. My Hadoop cluster configuration is relatively low. The two nodes are dual-core CPU, and the other is 4 cores. The memory is sufficient, about 10 Gb, and many programs are sharing these nodes, such as database servers and SOLR clusters. If you increase the configuration, make some optimizations, and perform statistical analysis on 20000000 (20 million) records, the results should be available within 5 seconds.
Because the test data is randomly generated, gender values are 'M' and 'F', and user_type values are 1 to 10. After grouping, the data distribution is even.

Reference

  • Http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_impala_hbase.html
  • Http://shiyanjun.cn/archives/111.html

Original article address: Impala and HBase integration practices, thanks to the original author for sharing.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.