HBase quick data import-BulkLoad

Last Update:2015-08-08 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Apache HBase is a distributed, column-oriented open source database that allows us to access big data at random and in real time. But how can we effectively import data to HBase? HBase has multiple data import methods. The most direct method is to use TableOutputFormat as the output in MapReduce jobs or use standard client APIs. However, these methods are not very effective.

Bulkload uses MapReduce jobs to output table data in HBase internal data format, and then directly imports the generated StoreFiles into the cluster. Compared with HBase APIs, using Bulkload to import data consumes less CPU and network resources.

The Bulkload process consists of three parts:

1. extract data from the data source (usually text files or other databases) and upload it to HDFS

This step is beyond HBase's consideration. No matter what the data source is, you only need to upload the data to HDFS before proceeding to the next step.

2. Prepare data using a MapReduce job

This step requires a MapReduce job, and in most cases we need to write the Map function by ourselves. The Reduce function does not need to be considered and is provided by HBase. This job uses rowkey (row Key) as the output Key, KeyValue, Put, or Delete as the output Value. MapReduce jobs need to use HFileOutputFormat2 to generate HBase data files. To effectively import data, you need to configure HFileOutputFormat2 so that each output file is in a suitable region. To achieve this purpose, MapReduce jobs use the TotalOrderPartitioner class of Hadoop to split the output according to the table key value. The configureIncrementalLoad () method of HFileOutputFormat2 automatically completes the above work.

3. Tell the location of RegionServers data and import the data

This step is the simplest. Generally, you need to use LoadIncrementalHFiles (more familiar with the completebulkload tool) to pass the file location on HDFS to it, it will use RegionServer to import data to the corresponding region.

A simple and clear explanation of the entire process

Image from How-to: Use HBase Bulk Loading, and Why

Note: Before BulkLoad, create an empty table with the same name and structure as the program in HBase.

Java implementation is as follows:

BulkLoadDriver. java

Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. conf. Configured;
Import org. apache. hadoop. fs. FileSystem;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. hbase. HBaseConfiguration;
Import org. apache. hadoop. hbase. TableName;
Import org. apache. hadoop. hbase. client. Connection;
Import org. apache. hadoop. hbase. client. ConnectionFactory;
Import org. apache. hadoop. hbase. client. Put;
Import org. apache. hadoop. hbase. io. ImmutableBytesWritable;
Import org. apache. hadoop. hbase. mapreduce. HFileOutputFormat2;
Import org. apache. hadoop. mapreduce. Job;
Import org. apache. hadoop. mapreduce. lib. input. FileInputFormat;
Import org. apache. hadoop. mapreduce. lib. input. TextInputFormat;
Import org. apache. hadoop. mapreduce. lib. output. FileOutputFormat;
Import org. apache. hadoop. util. Tool;
Import org. apache. hadoop. util. ToolRunner;

/**
* Created by shaobo on 15-6-9.
*/
Public class BulkLoadDriver extends Configured implements Tool {
Private static final String DATA_SEPERATOR = "\ s + ";
Private static final String TABLE_NAME = "temperature"; // table name
Private static final String COLUMN_FAMILY_1 = "date"; // column group 1
Private static final String COLUMN_FAMILY_2 = "tempPerHour"; // column group 2

Public static void main (String [] args ){
Try {
Int response = ToolRunner. run (HBaseConfiguration. create (), new BulkLoadDriver (), args );
If (response = 0 ){
System. out. println ("Job is successfully completed ...");
} Else {
System. out. println ("Job failed ...");
}
} Catch (Exception exception ){
Exception. printStackTrace ();
}
}

Public int run (String [] args) throws Exception {
String outputPath = args [1];
/**
* Set job parameters
*/
Configuration configuration = getConf ();
Configuration. set ("data. seperator", DATA_SEPERATOR );
Configuration. set ("hbase. table. name", TABLE_NAME );
Configuration. set ("COLUMN_FAMILY_1", COLUMN_FAMILY_1 );
Configuration. set ("COLUMN_FAMILY_2", COLUMN_FAMILY_2 );
Job job = Job. getInstance (configuration, "Bulk Loading HBase Table:" + TABLE_NAME );
Job. setJarByClass (BulkLoadDriver. class );
Job. setInputFormatClass (TextInputFormat. class );
Job. setMapOutputKeyClass (ImmutableBytesWritable. class); // specifies the output key class
Job. setMapOutputValueClass (Put. class); // specify the output value class
Job. setMapperClass (BulkLoadMapper. class); // specifies the Map function.
FileInputFormat. addInputPaths (job, args [0]); // input path
FileSystem fs = FileSystem. get (configuration );
Path output = new Path (outputPath );
If (fs. exists (output )){
Fs. delete (output, true); // delete an output path if it exists.
}
FileOutputFormat. setOutputPath (job, output); // output path
Connection connection = ConnectionFactory. createConnection (configuration );
TableName tableName = TableName. valueOf (TABLE_NAME );
HFileOutputFormat2.configureIncrementalLoad (job, connection. getTable (tableName), connection. getRegionLocator (tableName ));
Job. waitForCompletion (true );
If (job. isSuccessful ()){
HFileLoader. doBulkLoad (outputPath, TABLE_NAME); // import data
Return 0;
} Else {
Return 1;
}
}

}

BulkLoadMapper. java

Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. hbase. client. Put;
Import org. apache. hadoop. hbase. io. ImmutableBytesWritable;
Import org. apache. hadoop. hbase. util. Bytes;
Import org. apache. hadoop. io. LongWritable;
Import org. apache. hadoop. io. Text;
Import org. apache. hadoop. mapreduce. Mapper;

/**
* Created by shaobo on 15-6-9.
*/
Public class BulkLoadMapper extends Mapper <LongWritable, Text, ImmutableBytesWritable, Put> {
Private String hbaseTable;
Private String dataSeperator;
Private String columnFamily1;
Private String columnFamily2;

Public void setup (Context context ){
Configuration configuration = context. getConfiguration (); // get job parameters
HbaseTable = configuration. get ("hbase. table. name ");
DataSeperator = configuration. get ("data. seperator ");
ColumnFamily1 = configuration. get ("COLUMN_FAMILY_1 ");
ColumnFamily2 = configuration. get ("COLUMN_FAMILY_2 ");
}

Public void map (LongWritable key, Text value, Context context ){
Try {
String [] values = value. toString (). split (dataSeperator );
ImmutableBytesWritable rowKey = new ImmutableBytesWritable (values [0]. getBytes ());
Put put = new Put (Bytes. toBytes (values [0]);
Put. addColumn (Bytes. toBytes (columnFamily1), Bytes. toBytes ("month"), Bytes. toBytes (values [1]);
Put. addColumn (Bytes. toBytes (columnFamily1), Bytes. toBytes ("day"), Bytes. toBytes (values [2]);
For (int I = 3; I <values. length; ++ I ){
Put. addColumn (Bytes. toBytes (columnFamily2), Bytes. toBytes ("hour:" + I), Bytes. toBytes (values [I]);
}
Context. write (rowKey, put );
} Catch (Exception exception ){
Exception. printStackTrace ();
}

}

HFileLoader. java

Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. hbase. HBaseConfiguration;
Import org. apache. hadoop. hbase. client. HTable;
Import org. apache. hadoop. hbase. mapreduce. LoadIncrementalHFiles;

/**
* Created by shaobo on 15-6-9.
*/
Public class HFileLoader {
Public static void doBulkLoad (String pathToHFile, String tableName ){
Try {
Configuration configuration = new Configuration ();
HBaseConfiguration. addHbaseResources (configuration );
LoadIncrementalHFiles loadFfiles = new LoadIncrementalHFiles (configuration );
HTable hTable = new HTable (configuration, tableName); // specify the table name
LoadFfiles. doBulkLoad (new Path (pathToHFile), hTable); // import data
System. out. println ("Bulk Load Completed ..");
} Catch (Exception exception ){
Exception. printStackTrace ();
}

}

Program compilation and packaging, submitted to Hadoop for running

HADOOP_CLASSPATH = $ (hbase mapredcp):/path/to/hbase/conf hadoop jar BulkLoad. jar inputpath outputpath1

For the above command usage, refer to 44. HBase, MapReduce, and the CLASSPATH

Job running status:

15/06/14 14:31:07 INFO mapreduce. HFileOutputFormat2: Looking up current regions for table temperature (table name)
15/06/14 14:31:07 INFO mapreduce. HFileOutputFormat2: Sorting 1 reduce partitions to match current region count
15/06/14 14:31:07 INFO mapreduce. HFileOutputFormat2: Writing partition information to/home/shaobo/hadoop/tmp/partitions_5d464f1e-d412-4dbe-bb98-367f8431bdc9
15/06/14 14:31:07 INFO zlib. ZlibFactory: Successfully loaded & initialized native-zlib library
15/06/14 14:31:07 INFO compress. CodecPool: Got brand-new compressor [. deflate]
15/06/14 14:31:08 INFO mapreduce. HFileOutputFormat2: Incremental table temperature (table Name) output configured.
15/06/14 14:31:08 INFO client. RMProxy: Connecting to ResourceManager at localhost/127.0.0.1: 8032
15/06/14 14:31:15 INFO input. FileInputFormat: Total input paths to process: 2
15/06/14 14:31:15 INFO mapreduce. JobSubmitter: number of splits: 2
15/06/14 14:31:16 INFO mapreduce. JobSubmitter: Submitting tokens for job: job_1434262360688_0002
15/06/14 14:31:17 INFO impl. YarnClientImpl: Submitted application application_1434262360688_0002
15/06/14 14:31:17 INFO mapreduce. Job: The url to track the job: http: // shaobo-ThinkPad-E420: 8088/proxy/application_1434262360688_0002/
15/06/14 14:31:17 INFO mapreduce. Job: Running job: job_1434262360688_0002
15/06/14 14:31:28 INFO mapreduce. Job: Job job_1434262360688_0002 running in uber mode: false
15/06/14 14:31:28 INFO mapreduce. Job: map 0% reduce 0%
15/06/14 14:32:24 INFO mapreduce. Job: map 49% reduce 0%
15/06/14 14:32:37 INFO mapreduce. Job: map 67% reduce 0%
15/06/14 14:32:43 INFO mapreduce. Job: map 100% reduce 0%
15/06/14 14:33:39 INFO mapreduce. Job: map 100% reduce 67%
15/06/14 14:33:42 INFO mapreduce. Job: map 100% reduce 70%
15/06/14 14:33:45 INFO mapreduce. Job: map 100% reduce 88%
15/06/14 14:33:48 INFO mapreduce. Job: map 100% reduce 100%
15/06/14 14:33:52 INFO mapreduce. Job: Job job_1434262360688_0002 completed successfully
...
...
...
15/06/14 14:34:02 WARN mapreduce. LoadIncrementalHFiles: Skipping non-directory hdfs: // localhost: 9000/user/output/_ SUCCESS
15/06/14 14:34:03 INFO hfile. CacheConfig: disabled
15/06/14 14:34:03 INFO hfile. CacheConfig: disabled
15/06/14 14:34:07 INFO mapreduce. LoadIncrementalHFiles: Trying to load hfile = hdfs: // localhost: 9000/user/output/date/login first = AQW00061705 last = USW00094910
15/06/14 14:34:07 INFO mapreduce. LoadIncrementalHFiles: Trying to load hfile = hdfs: // localhost: 9000/user/output/tempPerHour/export first = AQW00061705 last = USW00094910
Bulk Load Completed ..
Job is successfully completed...

The third step of the BulLoad process can also be carried out in the command line after the HBase data file is generated by MapReduce job, not necessarily written together with the MapReduce process.

$ Hadoop jar hbase-server-VERSION.jar completebulkload [-c/path/to/hbase/config/hbase-site.xml] outputpath tablename1

If the following exception occurs during job submission:

15/06/16 11:41:06 INFO mapreduce. Job: Job job_1434420992867_0003 failed with state FAILED due to: Application failed 2 times due to AM Container for processing exited with exitCode:-1000
For more detailed output, check application tracking page: http: // cdh1: 8088/proxy/application_1434420992867_0003/Then, click on links to logs of each attempt.
Diagnostics: Rename cannot overwrite non empty destination directory/data/yarn/nm/usercache/hdfs/filecache/16
Java. io. IOException: Rename cannot overwrite non empty destination directory/data/yarn/nm/usercache/hdfs/filecache/16
At org. apache. hadoop. fs. AbstractFileSystem. renameInternal (AbstractFileSystem. java: 716)
At org. apache. hadoop. fs. FilterFs. renameInternal (FilterFs. java: 228)
At org. apache. hadoop. fs. AbstractFileSystem. rename (AbstractFileSystem. java: 659)
At org. apache. hadoop. fs. FileContext. rename (FileContext. java: 909)
At org. apache. hadoop. yarn. util. FSDownload. call (FSDownload. java: 364)
At org. apache. hadoop. yarn. util. FSDownload. call (FSDownload. java: 60)
At java. util. concurrent. FutureTask. run (FutureTask. java: 262)
At java. util. concurrent. Executors $ RunnableAdapter. call (Executors. java: 471)
At java. util. concurrent. FutureTask. run (FutureTask. java: 262)
At java. util. concurrent. ThreadPoolExecutor. runWorker (ThreadPoolExecutor. java: 1145)
At java. util. concurrent. ThreadPoolExecutor $ Worker. run (ThreadPoolExecutor. java: 615)
At java. lang. Thread. run (Thread. java: 745)

Failing this attempt. Failing the application.
15/06/16 11:41:06 INFO mapreduce. Job: Counters: 0

Delete the files under/data/yarn/nm/usercache/hdfs/filecache on the cdh2 and cdh3 machines. See http://stackoverflow.com/questions/30857413/hadoop-complains-about-attempting-to-overwrite-nonempty-destination-directory

Hadoop + HBase cloud storage creation summary PDF

Regionserver startup failed due to inconsistent time between HBase nodes

Hadoop + ZooKeeper + HBase cluster configuration

Hadoop cluster Installation & HBase lab environment setup

HBase cluster configuration based on Hadoop cluster'

Hadoop installation and deployment notes-HBase full distribution mode installation

Detailed tutorial on creating HBase environment for standalone Edition

HBase details: click here
HBase: click here

This article permanently updates the link address:

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

HBase quick data import-BulkLoad

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

HBase quick data import-BulkLoad

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support