HBase quick data import-BulkLoad

Last Update:2018-05-31 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

ApacheHBase is a distributed, column-oriented open source database that allows us to access big data at random and in real time. But how can we effectively import data to HBase?

Apache HBase is a distributed, column-oriented open source database that allows us to access big data at random and in real time. But how can we effectively import data to HBase?

Apache HBase is a distributed, column-oriented open source database that allows us to access big data at random and in real time. But how can we effectively import data to HBase? HBase has multiple data import methods. The most direct method is to use TableOutputFormat as the output in MapReduce jobs or use standard client APIs. However, these methods are not very effective.

Bulkload uses MapReduce jobs to output table data in HBase internal data format, and then directly imports the generated StoreFiles into the cluster. Compared with HBase APIs, using Bulkload to import data consumes less CPU and network resources.

The Bulkload process consists of three parts:

1. extract data from the data source (usually text files or other databases) and upload it to HDFS

This step is beyond HBase's consideration. No matter what the data source is, you only need to upload the data to HDFS before proceeding to the next step.

2. Prepare data using a MapReduce job

This step requires a MapReduce job, and in most cases we need to write the Map function by ourselves. The Reduce function does not need to be considered and is provided by HBase. This job uses rowkey (row Key) as the output Key, KeyValue, Put, or Delete as the output Value. MapReduce jobs need to use HFileOutputFormat2 to generate HBase data files. To effectively import data, you need to configure HFileOutputFormat2 so that each output file is in a suitable region. To achieve this purpose, MapReduce jobs use the TotalOrderPartitioner class of Hadoop to split the output according to the table key value. The configureIncrementalLoad () method of HFileOutputFormat2 automatically completes the above work.

3. Tell the location of RegionServers data and import the data

This step is the simplest. Generally, you need to use LoadIncrementalHFiles (more familiar with the completebulkload tool) to pass the file location on HDFS to it, it will use RegionServer to import data to the corresponding region.

A simple and clear explanation of the entire process

Image from How-to: Use HBase Bulk Loading, and Why

Note: Before BulkLoad, create an empty table with the same name and structure as the program in HBase.

Java implementation is as follows:

BulkLoadDriver. java

Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. conf. Configured;
Import org. apache. hadoop. fs. FileSystem;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. hbase. HBaseConfiguration;
Import org. apache. hadoop. hbase. TableName;
Import org. apache. hadoop. hbase. client. Connection;
Import org. apache. hadoop. hbase. client. ConnectionFactory;
Import org. apache. hadoop. hbase. client. Put;
Import org. apache. hadoop. hbase. io. ImmutableBytesWritable;
Import org. apache. hadoop. hbase. mapreduce. HFileOutputFormat2;
Import org. apache. hadoop. mapreduce. Job;
Import org. apache. hadoop. mapreduce. lib. input. FileInputFormat;
Import org. apache. hadoop. mapreduce. lib. input. TextInputFormat;
Import org. apache. hadoop. mapreduce. lib. output. FileOutputFormat;
Import org. apache. hadoop. util. Tool;
Import org. apache. hadoop. util. ToolRunner;

/**
* Created by shaobo on 15-6-9.
*/
Public class BulkLoadDriver extends Configured implements Tool {
Private static final String DATA_SEPERATOR = "\ s + ";
Private static final String TABLE_NAME = "temperature"; // table name
Private static final String COLUMN_FAMILY_1 = "date"; // column group 1
Private static final String COLUMN_FAMILY_2 = "tempPerHour"; // column group 2

Public static void main (String [] args ){
Try {
Int response = ToolRunner. run (HBaseConfiguration. create (), new BulkLoadDriver (), args );
If (response = 0 ){
System. out. println ("Job is successfully completed ...");
} Else {
System. out. println ("Job failed ...");
}
} Catch (Exception exception ){
Exception. printStackTrace ();
}
}

Public int run (String [] args) throws Exception {
String outputPath = args [1];
/**
* Set job parameters
*/
Configuration configuration = getConf ();
Configuration. set ("data. seperator", DATA_SEPERATOR );
Configuration. set ("hbase. table. name", TABLE_NAME );
Configuration. set ("COLUMN_FAMILY_1", COLUMN_FAMILY_1 );
Configuration. set ("COLUMN_FAMILY_2", COLUMN_FAMILY_2 );
Job job = Job. getInstance (configuration, "Bulk Loading HBase Table:" + TABLE_NAME );
Job. setJarByClass (BulkLoadDriver. class );
Job. setInputFormatClass (TextInputFormat. class );
Job. setMapOutputKeyClass (ImmutableBytesWritable. class); // specifies the output key class
Job. setMapOutputValueClass (Put. class); // specify the output value class
Job. setMapperClass (BulkLoadMapper. class); // specifies the Map function.
FileInputFormat. addInputPaths (job, args [0]); // input path
FileSystem fs = FileSystem. get (configuration );
Path output = new Path (outputPath );
If (fs. exists (output )){
Fs. delete (output, true); // delete an output path if it exists.
}
FileOutputFormat. setOutputPath (job, output); // output path
Connection connection = ConnectionFactory. createConnection (configuration );
TableName tableName = TableName. valueOf (TABLE_NAME );
HFileOutputFormat2.configureIncrementalLoad (job, connection. getTable (tableName), connection. getRegionLocator (tableName ));
Job. waitForCompletion (true );
If (job. isSuccessful ()){
HFileLoader. doBulkLoad (outputPath, TABLE_NAME); // import data
Return 0;
} Else {
Return 1;
}
}

}

BulkLoadMapper. java

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

HBase quick data import-BulkLoad

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

HBase quick data import-BulkLoad

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support