HBase quick data import-BulkLoad
Apache HBase is a distributed, column-oriented open source database that allows us to access big data at random and in real time. But how can we effectively import data to HBase? HBase has multiple data import methods. The most direct method is to use TableOutputFormat as the output in MapReduce jobs or use standard client APIs. However, these methods are not very effective.
Bulkload uses MapReduce jobs to output table data in HBase internal data format, and then directly imports the generated StoreFiles into the cluster. Compared with HBase APIs, using Bulkload to import data consumes less CPU and network resources.
The Bulkload process consists of three parts:
1. extract data from the data source (usually text files or other databases) and upload it to HDFS
This step is beyond HBase's consideration. No matter what the data source is, you only need to upload the data to HDFS before proceeding to the next step.
2. Prepare data using a MapReduce job
This step requires a MapReduce job, and in most cases we need to write the Map function by ourselves. The Reduce function does not need to be considered and is provided by HBase. This job uses rowkey (row Key) as the output Key, KeyValue, Put, or Delete as the output Value. MapReduce jobs need to use HFileOutputFormat2 to generate HBase data files. To effectively import data, you need to configure HFileOutputFormat2 so that each output file is in a suitable region. To achieve this purpose, MapReduce jobs use the TotalOrderPartitioner class of Hadoop to split the output according to the table key value. The configureIncrementalLoad () method of HFileOutputFormat2 automatically completes the above work.
3. Tell the location of RegionServers data and import the data
This step is the simplest. Generally, you need to use LoadIncrementalHFiles (more familiar with the completebulkload tool) to pass the file location on HDFS to it, it will use RegionServer to import data to the corresponding region.
A simple and clear explanation of the entire process
Image from How-to: Use HBase Bulk Loading, and Why
Note: Before BulkLoad, create an empty table with the same name and structure as the program in HBase.
Java implementation is as follows:
BulkLoadDriver. java
Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. conf. Configured;
Import org. apache. hadoop. fs. FileSystem;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. hbase. HBaseConfiguration;
Import org. apache. hadoop. hbase. TableName;
Import org. apache. hadoop. hbase. client. Connection;
Import org. apache. hadoop. hbase. client. ConnectionFactory;
Import org. apache. hadoop. hbase. client. Put;
Import org. apache. hadoop. hbase. io. ImmutableBytesWritable;
Import org. apache. hadoop. hbase. mapreduce. HFileOutputFormat2;
Import org. apache. hadoop. mapreduce. Job;
Import org. apache. hadoop. mapreduce. lib. input. FileInputFormat;
Import org. apache. hadoop. mapreduce. lib. input. TextInputFormat;
Import org. apache. hadoop. mapreduce. lib. output. FileOutputFormat;
Import org. apache. hadoop. util. Tool;
Import org. apache. hadoop. util. ToolRunner;
/**
* Created by shaobo on 15-6-9.
*/
Public class BulkLoadDriver extends Configured implements Tool {
Private static final String DATA_SEPERATOR = "\ s + ";
Private static final String TABLE_NAME = "temperature"; // table name
Private static final String COLUMN_FAMILY_1 = "date"; // column group 1
Private static final String COLUMN_FAMILY_2 = "tempPerHour"; // column group 2
Public static void main (String [] args ){
Try {
Int response = ToolRunner. run (HBaseConfiguration. create (), new BulkLoadDriver (), args );
If (response = 0 ){
System. out. println ("Job is successfully completed ...");
} Else {
System. out. println ("Job failed ...");
}
} Catch (Exception exception ){
Exception. printStackTrace ();
}
}
Public int run (String [] args) throws Exception {
String outputPath = args [1];
/**
* Set job parameters
*/
Configuration configuration = getConf ();
Configuration. set ("data. seperator", DATA_SEPERATOR );
Configuration. set ("hbase. table. name", TABLE_NAME );
Configuration. set ("COLUMN_FAMILY_1", COLUMN_FAMILY_1 );
Configuration. set ("COLUMN_FAMILY_2", COLUMN_FAMILY_2 );
Job job = Job. getInstance (configuration, "Bulk Loading HBase Table:" + TABLE_NAME );
Job. setJarByClass (BulkLoadDriver. class );
Job. setInputFormatClass (TextInputFormat. class );
Job. setMapOutputKeyClass (ImmutableBytesWritable. class); // specifies the output key class
Job. setMapOutputValueClass (Put. class); // specify the output value class
Job. setMapperClass (BulkLoadMapper. class); // specifies the Map function.
FileInputFormat. addInputPaths (job, args [0]); // input path
FileSystem fs = FileSystem. get (configuration );
Path output = new Path (outputPath );
If (fs. exists (output )){
Fs. delete (output, true); // delete an output path if it exists.
}
FileOutputFormat. setOutputPath (job, output); // output path
Connection connection = ConnectionFactory. createConnection (configuration );
TableName tableName = TableName. valueOf (TABLE_NAME );
HFileOutputFormat2.configureIncrementalLoad (job, connection. getTable (tableName), connection. getRegionLocator (tableName ));
Job. waitForCompletion (true );
If (job. isSuccessful ()){
HFileLoader. doBulkLoad (outputPath, TABLE_NAME); // import data
Return 0;
} Else {
Return 1;
}
}
}
BulkLoadMapper. java
Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. hbase. client. Put;
Import org. apache. hadoop. hbase. io. ImmutableBytesWritable;
Import org. apache. hadoop. hbase. util. Bytes;
Import org. apache. hadoop. io. LongWritable;
Import org. apache. hadoop. io. Text;
Import org. apache. hadoop. mapreduce. Mapper;
/**
* Created by shaobo on 15-6-9.
*/
Public class BulkLoadMapper extends Mapper <LongWritable, Text, ImmutableBytesWritable, Put> {
Private String hbaseTable;
Private String dataSeperator;
Private String columnFamily1;
Private String columnFamily2;
Public void setup (Context context ){
Configuration configuration = context. getConfiguration (); // get job parameters
HbaseTable = configuration. get ("hbase. table. name ");
DataSeperator = configuration. get ("data. seperator ");
ColumnFamily1 = configuration. get ("COLUMN_FAMILY_1 ");
ColumnFamily2 = configuration. get ("COLUMN_FAMILY_2 ");
}
Public void map (LongWritable key, Text value, Context context ){
Try {
String [] values = value. toString (). split (dataSeperator );
ImmutableBytesWritable rowKey = new ImmutableBytesWritable (values [0]. getBytes ());
Put put = new Put (Bytes. toBytes (values [0]);
Put. addColumn (Bytes. toBytes (columnFamily1), Bytes. toBytes ("month"), Bytes. toBytes (values [1]);
Put. addColumn (Bytes. toBytes (columnFamily1), Bytes. toBytes ("day"), Bytes. toBytes (values [2]);
For (int I = 3; I <values. length; ++ I ){
Put. addColumn (Bytes. toBytes (columnFamily2), Bytes. toBytes ("hour:" + I), Bytes. toBytes (values [I]);
}
Context. write (rowKey, put );
} Catch (Exception exception ){
Exception. printStackTrace ();
}
}
}
HFileLoader. java
Import org. apache. hadoop. conf. Configuration;
Import org. apache. hadoop. fs. Path;
Import org. apache. hadoop. hbase. HBaseConfiguration;
Import org. apache. hadoop. hbase. client. HTable;
Import org. apache. hadoop. hbase. mapreduce. LoadIncrementalHFiles;
/**
* Created by shaobo on 15-6-9.
*/
Public class HFileLoader {
Public static void doBulkLoad (String pathToHFile, String tableName ){
Try {
Configuration configuration = new Configuration ();
HBaseConfiguration. addHbaseResources (configuration );
LoadIncrementalHFiles loadFfiles = new LoadIncrementalHFiles (configuration );
HTable hTable = new HTable (configuration, tableName); // specify the table name
LoadFfiles. doBulkLoad (new Path (pathToHFile), hTable); // import data
System. out. println ("Bulk Load Completed ..");
} Catch (Exception exception ){
Exception. printStackTrace ();
}
}
}
Program compilation and packaging, submitted to Hadoop for running
HADOOP_CLASSPATH = $ (hbase mapredcp):/path/to/hbase/conf hadoop jar BulkLoad. jar inputpath outputpath1
For the above command usage, refer to 44. HBase, MapReduce, and the CLASSPATH
Job running status:
15/06/14 14:31:07 INFO mapreduce. HFileOutputFormat2: Looking up current regions for table temperature (table name)
15/06/14 14:31:07 INFO mapreduce. HFileOutputFormat2: Sorting 1 reduce partitions to match current region count
15/06/14 14:31:07 INFO mapreduce. HFileOutputFormat2: Writing partition information to/home/shaobo/hadoop/tmp/partitions_5d464f1e-d412-4dbe-bb98-367f8431bdc9
15/06/14 14:31:07 INFO zlib. ZlibFactory: Successfully loaded & initialized native-zlib library
15/06/14 14:31:07 INFO compress. CodecPool: Got brand-new compressor [. deflate]
15/06/14 14:31:08 INFO mapreduce. HFileOutputFormat2: Incremental table temperature (table Name) output configured.
15/06/14 14:31:08 INFO client. RMProxy: Connecting to ResourceManager at localhost/127.0.0.1: 8032
15/06/14 14:31:15 INFO input. FileInputFormat: Total input paths to process: 2
15/06/14 14:31:15 INFO mapreduce. JobSubmitter: number of splits: 2
15/06/14 14:31:16 INFO mapreduce. JobSubmitter: Submitting tokens for job: job_1434262360688_0002
15/06/14 14:31:17 INFO impl. YarnClientImpl: Submitted application application_1434262360688_0002
15/06/14 14:31:17 INFO mapreduce. Job: The url to track the job: http: // shaobo-ThinkPad-E420: 8088/proxy/application_1434262360688_0002/
15/06/14 14:31:17 INFO mapreduce. Job: Running job: job_1434262360688_0002
15/06/14 14:31:28 INFO mapreduce. Job: Job job_1434262360688_0002 running in uber mode: false
15/06/14 14:31:28 INFO mapreduce. Job: map 0% reduce 0%
15/06/14 14:32:24 INFO mapreduce. Job: map 49% reduce 0%
15/06/14 14:32:37 INFO mapreduce. Job: map 67% reduce 0%
15/06/14 14:32:43 INFO mapreduce. Job: map 100% reduce 0%
15/06/14 14:33:39 INFO mapreduce. Job: map 100% reduce 67%
15/06/14 14:33:42 INFO mapreduce. Job: map 100% reduce 70%
15/06/14 14:33:45 INFO mapreduce. Job: map 100% reduce 88%
15/06/14 14:33:48 INFO mapreduce. Job: map 100% reduce 100%
15/06/14 14:33:52 INFO mapreduce. Job: Job job_1434262360688_0002 completed successfully
...
...
...
15/06/14 14:34:02 WARN mapreduce. LoadIncrementalHFiles: Skipping non-directory hdfs: // localhost: 9000/user/output/_ SUCCESS
15/06/14 14:34:03 INFO hfile. CacheConfig: disabled
15/06/14 14:34:03 INFO hfile. CacheConfig: disabled
15/06/14 14:34:07 INFO mapreduce. LoadIncrementalHFiles: Trying to load hfile = hdfs: // localhost: 9000/user/output/date/login first = AQW00061705 last = USW00094910
15/06/14 14:34:07 INFO mapreduce. LoadIncrementalHFiles: Trying to load hfile = hdfs: // localhost: 9000/user/output/tempPerHour/export first = AQW00061705 last = USW00094910
Bulk Load Completed ..
Job is successfully completed...
The third step of the BulLoad process can also be carried out in the command line after the HBase data file is generated by MapReduce job, not necessarily written together with the MapReduce process.
$ Hadoop jar hbase-server-VERSION.jar completebulkload [-c/path/to/hbase/config/hbase-site.xml] outputpath tablename1
If the following exception occurs during job submission:
15/06/16 11:41:06 INFO mapreduce. Job: Job job_1434420992867_0003 failed with state FAILED due to: Application failed 2 times due to AM Container for processing exited with exitCode:-1000
For more detailed output, check application tracking page: http: // cdh1: 8088/proxy/application_1434420992867_0003/Then, click on links to logs of each attempt.
Diagnostics: Rename cannot overwrite non empty destination directory/data/yarn/nm/usercache/hdfs/filecache/16
Java. io. IOException: Rename cannot overwrite non empty destination directory/data/yarn/nm/usercache/hdfs/filecache/16
At org. apache. hadoop. fs. AbstractFileSystem. renameInternal (AbstractFileSystem. java: 716)
At org. apache. hadoop. fs. FilterFs. renameInternal (FilterFs. java: 228)
At org. apache. hadoop. fs. AbstractFileSystem. rename (AbstractFileSystem. java: 659)
At org. apache. hadoop. fs. FileContext. rename (FileContext. java: 909)
At org. apache. hadoop. yarn. util. FSDownload. call (FSDownload. java: 364)
At org. apache. hadoop. yarn. util. FSDownload. call (FSDownload. java: 60)
At java. util. concurrent. FutureTask. run (FutureTask. java: 262)
At java. util. concurrent. Executors $ RunnableAdapter. call (Executors. java: 471)
At java. util. concurrent. FutureTask. run (FutureTask. java: 262)
At java. util. concurrent. ThreadPoolExecutor. runWorker (ThreadPoolExecutor. java: 1145)
At java. util. concurrent. ThreadPoolExecutor $ Worker. run (ThreadPoolExecutor. java: 615)
At java. lang. Thread. run (Thread. java: 745)
Failing this attempt. Failing the application.
15/06/16 11:41:06 INFO mapreduce. Job: Counters: 0
Delete the files under/data/yarn/nm/usercache/hdfs/filecache on the cdh2 and cdh3 machines. See http://stackoverflow.com/questions/30857413/hadoop-complains-about-attempting-to-overwrite-nonempty-destination-directory
Hadoop + HBase cloud storage creation summary PDF
Regionserver startup failed due to inconsistent time between HBase nodes
Hadoop + ZooKeeper + HBase cluster configuration
Hadoop cluster Installation & HBase lab environment setup
HBase cluster configuration based on Hadoop cluster'
Hadoop installation and deployment notes-HBase full distribution mode installation
Detailed tutorial on creating HBase environment for standalone Edition
HBase details: click here
HBase: click here
This article permanently updates the link address: