We know that HBase is a column-based NoSQL database that supports flexible data storage. It is a large table. In some applications, by designing RowKey, you can quickly store and access massive data. However, for complex query statistics requirements, if it is directly implemented based on HBaseAPI, the performance is very poor, Or, you can
We know that HBase is a column-based NoSQL database that supports flexible data storage. It is a large table. In some applications, by designing RowKey, you can quickly store and access massive data. However, for complex query statistics requirements, if it is directly implemented based on HBase APIs, the performance is very poor, or you can
We know that HBase is a column-based NoSQL database that supports flexible data storage. It is a large table. In some applications, by designing RowKey, you can quickly store and access massive data. However, for complex query statistics requirements, if implemented directly based on HBase APIs, the performance is very poor, or you can implement MapReduce programs for query analysis, this also inherits the latency of MapReduce.
To achieve Impala and HBase integration, we can obtain the following benefits:
- We can use familiar SQL statements. Like traditional relational databases, it is easy to provide SQL Design for complex queries and statistical analysis.
- Impala query statistics and analysis is much faster than native MapReduce and Hive.
To integrate Impala with HBase, You need to map the RowKey and column of HBase to the Table field of Impala. Impala uses Hive Metastore to store metadata. Similar to Hive, it is implemented through EXTERNAL tables when HBase is integrated.
Preparations
First, we need to make the following preparations:
- Install and configure Hadoop cluster (http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_4_4.html)
- Install and configure HBase cluster (http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_20.html)
- Install and configure Hive (http://www.cloudera.com/content/cloudera-content/cloudera-docs/CDH4/latest/CDH4-Installation-Guide/cdh4ig_topic_18.html)
- Install configuration Impala (http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_noncm_installation.html? Scroll = noncm_installation)
For installation and configuration of related systems, refer to relevant documents and materials.
The following describes how to integrate Impala with HBase through the example Table test_info:
Integration Process
First, we use HBase Shell to create a table, as shown below:
create 'test_info', 'info'
The table name is test_info and there is only one Column Family named info. We plan to have four columns in this Column cluster: info: user_id, info: user_type, info: gender, info: birthday.
- Create an External table in Hive
Create an External table. The corresponding DDL is as follows:
CREATE EXTERNAL TABLE sho.test_info( user_id string, user_type tinyint, gender string, birthday string)ROW FORMAT SERDE 'org.apache.hadoop.hive.hbase.HBaseSerDe'STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'WITH SERDEPROPERTIES ("hbase.columns.mapping" = ":key, info:user_type, info:gender, info:birthday")TBLPROPERTIES("hbase.table.name" = "test_info");
In the preceding DDL statement, specify the ing between Hive External table fields and HBase columns in the with serdeproperties option. ": key" corresponds to RowKey in HBase and is named "user_id ", the rest are the column names in the column cluster info. Finally, the table name to be mapped in HBase is specified in TBLPROPERTIES.
- Synchronize metadata in Impala
Impala shares the Hive Metastore. To synchronize metadata, run the following command in Impala Shell:
INVALIDATE METADATA;
Then, you can view the structure of the table mapped to HBase:
DESC test_info;
Table Structure:
Through the above three steps, we have completed the integration configuration of Impala and HBase.
Verify Integration
Next, we will verify whether the above configuration takes effect through practice.
We simulate that the client inserts data into the HBase table, which can be implemented using HBase API or HBase Thrift. Here we use the HBase Thrift interface for operations. For details, see the HBase Thrift client Java API practice.
Then, we can use Impala Shell for query and analysis. Insert a 20000000 (20 million) record based on the example table created for integration above. Let's make a statistical analysis example. The SQL statement is as follows:
SELECT user_type, COUNT(user_id) AS cnt FROM test_info WHERE gender='M' GROUP BY user_type ORDER BY cnt DESC LIMIT 10;
Shows the running result:
The Hadoop cluster in which the preceding program runs has three Datanode instances. The total execution time of the preceding statistical SQL statements is 88.13 s. My Hadoop cluster configuration is relatively low. The two nodes are dual-core CPU, and the other is 4 cores. The memory is sufficient, about 10 Gb, and many programs are sharing these nodes, such as database servers and SOLR clusters. If you increase the configuration, make some optimizations, and perform statistical analysis on 20000000 (20 million) records, the results should be available within 5 seconds.
Because the test data is randomly generated, gender values are 'M' and 'F', and user_type values are 1 to 10. After grouping, the data distribution is even.
Reference
- Http://www.cloudera.com/content/cloudera-content/cloudera-docs/Impala/latest/Installing-and-Using-Impala/ciiu_impala_hbase.html
- Http://shiyanjun.cn/archives/111.html
Original article address: Impala and HBase integration practices, thanks to the original author for sharing.