HBase quick data import-BulkLoad

Source: Internet
Author: User
ApacheHBase is a distributed, column-oriented open source database that allows us to access big data at random and in real time. But how can we effectively import data to HBase?

Apache HBase is a distributed, column-oriented open source database that allows us to access big data at random and in real time. But how can we effectively import data to HBase?

Apache HBase is a distributed, column-oriented open source database that allows us to access big data at random and in real time. But how can we effectively import data to HBase? HBase has multiple data import methods. The most direct method is to use TableOutputFormat as the output in MapReduce jobs or use standard client APIs. However, these methods are not very effective.

Bulkload uses MapReduce jobs to output table data in HBase internal data format, and then directly imports the generated StoreFiles into the cluster. Compared with HBase APIs, using Bulkload to import data consumes less CPU and network resources.

The Bulkload process consists of three parts:

1. extract data from the data source (usually text files or other databases) and upload it to HDFS

This step is beyond HBase's consideration. No matter what the data source is, you only need to upload the data to HDFS before proceeding to the next step.

2. Prepare data using a MapReduce job

This step requires a MapReduce job, and in most cases we need to write the Map function by ourselves. The Reduce function does not need to be considered and is provided by HBase. This job uses rowkey (row Key) as the output Key, KeyValue, Put, or Delete as the output Value. MapReduce jobs need to use HFileOutputFormat2 to generate HBase data files. To effectively import data, you need to configure HFileOutputFormat2 so that each output file is in a suitable region. To achieve this purpose, MapReduce jobs use the TotalOrderPartitioner class of Hadoop to split the output according to the table key value. The configureIncrementalLoad () method of HFileOutputFormat2 automatically completes the above work.

3. Tell the location of RegionServers data and import the data

This step is the simplest. Generally, you need to use LoadIncrementalHFiles (more familiar with the completebulkload tool) to pass the file location on HDFS to it, it will use RegionServer to import data to the corresponding region.

A simple and clear explanation of the entire process

Image from How-to: Use HBase Bulk Loading, and Why

Note: Before BulkLoad, create an empty table with the same name and structure as the program in HBase.

Java implementation is as follows:

BulkLoadDriver. java

Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. conf. Configured;
Import org. apache. hadoop. fs. FileSystem;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. hbase. HBaseConfiguration;
Import org. apache. hadoop. hbase. TableName;
Import org. apache. hadoop. hbase. client. Connection;
Import org. apache. hadoop. hbase. client. ConnectionFactory;
Import org. apache. hadoop. hbase. client. Put;
Import org. apache. hadoop. hbase. io. ImmutableBytesWritable;
Import org. apache. hadoop. hbase. mapreduce. HFileOutputFormat2;
Import org. apache. hadoop. mapreduce. Job;
Import org. apache. hadoop. mapreduce. lib. input. FileInputFormat;
Import org. apache. hadoop. mapreduce. lib. input. TextInputFormat;
Import org. apache. hadoop. mapreduce. lib. output. FileOutputFormat;
Import org. apache. hadoop. util. Tool;
Import org. apache. hadoop. util. ToolRunner;

/**
* Created by shaobo on 15-6-9.
*/
Public class BulkLoadDriver extends Configured implements Tool {
Private static final String DATA_SEPERATOR = "\ s + ";
Private static final String TABLE_NAME = "temperature"; // table name
Private static final String COLUMN_FAMILY_1 = "date"; // column group 1
Private static final String COLUMN_FAMILY_2 = "tempPerHour"; // column group 2

Public static void main (String [] args ){
Try {
Int response = ToolRunner. run (HBaseConfiguration. create (), new BulkLoadDriver (), args );
If (response = 0 ){
System. out. println ("Job is successfully completed ...");
} Else {
System. out. println ("Job failed ...");
}
} Catch (Exception exception ){
Exception. printStackTrace ();
}
}

Public int run (String [] args) throws Exception {
String outputPath = args [1];
/**
* Set job parameters
*/
Configuration configuration = getConf ();
Configuration. set ("data. seperator", DATA_SEPERATOR );
Configuration. set ("hbase. table. name", TABLE_NAME );
Configuration. set ("COLUMN_FAMILY_1", COLUMN_FAMILY_1 );
Configuration. set ("COLUMN_FAMILY_2", COLUMN_FAMILY_2 );
Job job = Job. getInstance (configuration, "Bulk Loading HBase Table:" + TABLE_NAME );
Job. setJarByClass (BulkLoadDriver. class );
Job. setInputFormatClass (TextInputFormat. class );
Job. setMapOutputKeyClass (ImmutableBytesWritable. class); // specifies the output key class
Job. setMapOutputValueClass (Put. class); // specify the output value class
Job. setMapperClass (BulkLoadMapper. class); // specifies the Map function.
FileInputFormat. addInputPaths (job, args [0]); // input path
FileSystem fs = FileSystem. get (configuration );
Path output = new Path (outputPath );
If (fs. exists (output )){
Fs. delete (output, true); // delete an output path if it exists.
}
FileOutputFormat. setOutputPath (job, output); // output path
Connection connection = ConnectionFactory. createConnection (configuration );
TableName tableName = TableName. valueOf (TABLE_NAME );
HFileOutputFormat2.configureIncrementalLoad (job, connection. getTable (tableName), connection. getRegionLocator (tableName ));
Job. waitForCompletion (true );
If (job. isSuccessful ()){
HFileLoader. doBulkLoad (outputPath, TABLE_NAME); // import data
Return 0;
} Else {
Return 1;
}
}

}

BulkLoadMapper. java

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.