HBase快速匯入資料--BulkLoad

最後更新：2015-08-08 來源：互聯網

上載者：User

創建阿里雲帳戶，並獲得超過 40 款產品的免費試用版；而企業帳戶則可以享有總值 $1200 的免費試用版。立即註冊！

Apache HBase是一個分布式的、面向列的開來源資料庫，它可以讓我們隨機的、即時的訪問大資料。但是怎樣有效將資料匯入到HBase呢？HBase有多種匯入資料的方法，最直接的方法就是在MapReduce作業中使用TableOutputFormat作為輸出，或者使用標準的用戶端API，但是這些都不是非常有效方法。

Bulkload利用MapReduce作業輸出HBase內部資料格式的表資料，然後將產生的StoreFiles直接匯入到叢集中。與使用HBase API相比，使用Bulkload匯入資料佔用更少的CPU和網路資源。

Bulkload過程主要包括三部分：

1.從資料來源(通常是文字檔或其他的資料庫)提取資料並上傳到HDFS

這一步不在HBase的考慮範圍內，不管資料來源是什麼，只要在進行下一步之前將資料上傳到HDFS即可。

2.利用一個MapReduce作業準備資料

這一步需要一個MapReduce作業，並且大多數情況下還需要我們自己編寫Map函數，而Reduce函數不需要我們考慮，由HBase提供。該作業需要使用rowkey(行鍵)作為輸出Key，KeyValue、Put或者Delete作為輸出Value。MapReduce作業需要使用HFileOutputFormat2來產生HBase資料檔案。為了有效匯入資料，需要配置HFileOutputFormat2使得每一個輸出檔案都在一個合適的地區中。為了達到這個目的，MapReduce作業會使用Hadoop的TotalOrderPartitioner類根據表的key值將輸出分割開來。HFileOutputFormat2的方法configureIncrementalLoad()會自動的完成上面的工作。

3.告訴RegionServers資料的位置並匯入資料

這一步是最簡單的，通常需要使用LoadIncrementalHFiles(更為人所熟知是completebulkload工具)，將檔案在HDFS上的位置傳遞給它，它就會利用RegionServer將資料匯入到相應的地區。

簡單明確的說明了整個過程

圖片來自How-to: Use HBase Bulk Loading, and Why

Note:在進行BulkLoad之前，要在HBase中建立與程式中同名且結構相同的空表

Java實現如下：

BulkLoadDriver.java

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.conf.Configured;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.Connection;
import org.apache.hadoop.hbase.client.ConnectionFactory;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.mapreduce.HFileOutputFormat2;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.input.TextInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.Tool;
import org.apache.hadoop.util.ToolRunner;

/**
* Created by shaobo on 15-6-9.
*/
public class BulkLoadDriver extends Configured implements Tool {
private static final String DATA_SEPERATOR = "\\s+";
private static final String TABLE_NAME = "temperature";//表名
private static final String COLUMN_FAMILY_1="date";//列組1
private static final String COLUMN_FAMILY_2="tempPerHour";//列組2

public static void main(String[] args) {
try {
int response = ToolRunner.run(HBaseConfiguration.create(), new BulkLoadDriver(), args);
if(response == 0) {
System.out.println("Job is successfully completed...");
} else {
System.out.println("Job failed...");
}
} catch(Exception exception) {
exception.printStackTrace();
}
}

public int run(String[] args) throws Exception {
String outputPath = args[1];
/**
* 設定作業參數
*/
Configuration configuration = getConf();
configuration.set("data.seperator", DATA_SEPERATOR);
configuration.set("hbase.table.name", TABLE_NAME);
configuration.set("COLUMN_FAMILY_1", COLUMN_FAMILY_1);
configuration.set("COLUMN_FAMILY_2", COLUMN_FAMILY_2);
Job job = Job.getInstance(configuration, "Bulk Loading HBase Table::" + TABLE_NAME);
job.setJarByClass(BulkLoadDriver.class);
job.setInputFormatClass(TextInputFormat.class);
job.setMapOutputKeyClass(ImmutableBytesWritable.class);//指定輸出鍵類
job.setMapOutputValueClass(Put.class);//指定輸出值類
job.setMapperClass(BulkLoadMapper.class);//指定Map函數
FileInputFormat.addInputPaths(job, args[0]);//輸入路徑
FileSystem fs = FileSystem.get(configuration);
Path output = new Path(outputPath);
if (fs.exists(output)) {
fs.delete(output, true);//如果輸出路徑存在，就將其刪除
}
FileOutputFormat.setOutputPath(job, output);//輸出路徑
Connection connection = ConnectionFactory.createConnection(configuration);
TableName tableName = TableName.valueOf(TABLE_NAME);
HFileOutputFormat2.configureIncrementalLoad(job, connection.getTable(tableName), connection.getRegionLocator(tableName));
job.waitForCompletion(true);
if (job.isSuccessful()){
HFileLoader.doBulkLoad(outputPath, TABLE_NAME);//匯入資料
return 0;
} else {
return 1;
}
}

}

BulkLoadMapper.java

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.hbase.client.Put;
import org.apache.hadoop.hbase.io.ImmutableBytesWritable;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Mapper;

/**
* Created by shaobo on 15-6-9.
*/
public class BulkLoadMapper extends Mapper<LongWritable, Text, ImmutableBytesWritable, Put> {
private String hbaseTable;
private String dataSeperator;
private String columnFamily1;
private String columnFamily2;

public void setup(Context context) {
Configuration configuration = context.getConfiguration();//擷取作業參數
hbaseTable = configuration.get("hbase.table.name");
dataSeperator = configuration.get("data.seperator");
columnFamily1 = configuration.get("COLUMN_FAMILY_1");
columnFamily2 = configuration.get("COLUMN_FAMILY_2");
}

public void map(LongWritable key, Text value, Context context){
try {
String[] values = value.toString().split(dataSeperator);
ImmutableBytesWritable rowKey = new ImmutableBytesWritable(values[0].getBytes());
Put put = new Put(Bytes.toBytes(values[0]));
put.addColumn(Bytes.toBytes(columnFamily1), Bytes.toBytes("month"), Bytes.toBytes(values[1]));
put.addColumn(Bytes.toBytes(columnFamily1), Bytes.toBytes("day"), Bytes.toBytes(values[2]));
for (int i = 3; i < values.length; ++i){
put.addColumn(Bytes.toBytes(columnFamily2), Bytes.toBytes("hour : " + i), Bytes.toBytes(values[i]));
}
context.write(rowKey, put);
} catch(Exception exception) {
exception.printStackTrace();
}

}

HFileLoader.java

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.hbase.HBaseConfiguration;
import org.apache.hadoop.hbase.client.HTable;
import org.apache.hadoop.hbase.mapreduce.LoadIncrementalHFiles;

/**
* Created by shaobo on 15-6-9.
*/
public class HFileLoader {
public static void doBulkLoad(String pathToHFile, String tableName){
try {
Configuration configuration = new Configuration();
HBaseConfiguration.addHbaseResources(configuration);
LoadIncrementalHFiles loadFfiles = new LoadIncrementalHFiles(configuration);
HTable hTable = new HTable(configuration, tableName);//指定表名
loadFfiles.doBulkLoad(new Path(pathToHFile), hTable);//匯入資料
System.out.println("Bulk Load Completed..");
} catch(Exception exception) {
exception.printStackTrace();
}

}

程式編譯打包，提交到Hadoop運行

HADOOP_CLASSPATH=$(hbase mapredcp):/path/to/hbase/conf hadoop jar BulkLoad.jar inputpath outputpath1

上述命令用法可參考 44. HBase, MapReduce, and the CLASSPATH

作業運行情況：

15/06/14 14:31:07 INFO mapreduce.HFileOutputFormat2: Looking up current regions for table temperature(表名)
15/06/14 14:31:07 INFO mapreduce.HFileOutputFormat2: Configuring 1 reduce partitions to match current region count
15/06/14 14:31:07 INFO mapreduce.HFileOutputFormat2: Writing partition information to /home/shaobo/hadoop/tmp/partitions_5d464f1e-d412-4dbe-bb98-367f8431bdc9
15/06/14 14:31:07 INFO zlib.ZlibFactory: Successfully loaded & initialized native-zlib library
15/06/14 14:31:07 INFO compress.CodecPool: Got brand-new compressor [.deflate]
15/06/14 14:31:08 INFO mapreduce.HFileOutputFormat2: Incremental table temperature(表名) output configured.
15/06/14 14:31:08 INFO client.RMProxy: Connecting to ResourceManager at localhost/127.0.0.1:8032
15/06/14 14:31:15 INFO input.FileInputFormat: Total input paths to process : 2
15/06/14 14:31:15 INFO mapreduce.JobSubmitter: number of splits:2
15/06/14 14:31:16 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1434262360688_0002
15/06/14 14:31:17 INFO impl.YarnClientImpl: Submitted application application_1434262360688_0002
15/06/14 14:31:17 INFO mapreduce.Job: The url to track the job: http://shaobo-ThinkPad-E420:8088/proxy/application_1434262360688_0002/
15/06/14 14:31:17 INFO mapreduce.Job: Running job: job_1434262360688_0002
15/06/14 14:31:28 INFO mapreduce.Job: Job job_1434262360688_0002 running in uber mode : false
15/06/14 14:31:28 INFO mapreduce.Job: map 0% reduce 0%
15/06/14 14:32:24 INFO mapreduce.Job: map 49% reduce 0%
15/06/14 14:32:37 INFO mapreduce.Job: map 67% reduce 0%
15/06/14 14:32:43 INFO mapreduce.Job: map 100% reduce 0%
15/06/14 14:33:39 INFO mapreduce.Job: map 100% reduce 67%
15/06/14 14:33:42 INFO mapreduce.Job: map 100% reduce 70%
15/06/14 14:33:45 INFO mapreduce.Job: map 100% reduce 88%
15/06/14 14:33:48 INFO mapreduce.Job: map 100% reduce 100%
15/06/14 14:33:52 INFO mapreduce.Job: Job job_1434262360688_0002 completed successfully
...
...
...
15/06/14 14:34:02 WARN mapreduce.LoadIncrementalHFiles: Skipping non-directory hdfs://localhost:9000/user/output/_SUCCESS
15/06/14 14:34:03 INFO hfile.CacheConfig: CacheConfig:disabled
15/06/14 14:34:03 INFO hfile.CacheConfig: CacheConfig:disabled
15/06/14 14:34:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://localhost:9000/user/output/date/c64cd2524fba48738bab26630d550b61 first=AQW00061705 last=USW00094910
15/06/14 14:34:07 INFO mapreduce.LoadIncrementalHFiles: Trying to load hfile=hdfs://localhost:9000/user/output/tempPerHour/43af29456913444795a820544691eb3d first=AQW00061705 last=USW00094910
Bulk Load Completed..
Job is successfully completed...

BulLoad過程的第三步也可以在用MapReduce作業產生HBase資料檔案後在命令列中進行，不一定要與MapReduce過程寫在一起。

$ hadoop jar hbase-server-VERSION.jar completebulkload [-c /path/to/hbase/config/hbase-site.xml] outputpath tablename1

若在提交作業是產生如下異常：

15/06/16 11:41:06 INFO mapreduce.Job: Job job_1434420992867_0003 failed with state FAILED due to: Application application_1434420992867_0003 failed 2 times due to AM Container for appattempt_1434420992867_0003_000002 exited with exitCode: -1000
For more detailed output, check application tracking page:http://cdh1:8088/proxy/application_1434420992867_0003/Then, click on links to logs of each attempt.
Diagnostics: Rename cannot overwrite non empty destination directory /data/yarn/nm/usercache/hdfs/filecache/16
java.io.IOException: Rename cannot overwrite non empty destination directory /data/yarn/nm/usercache/hdfs/filecache/16
at org.apache.hadoop.fs.AbstractFileSystem.renameInternal(AbstractFileSystem.java:716)
at org.apache.hadoop.fs.FilterFs.renameInternal(FilterFs.java:228)
at org.apache.hadoop.fs.AbstractFileSystem.rename(AbstractFileSystem.java:659)
at org.apache.hadoop.fs.FileContext.rename(FileContext.java:909)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:364)
at org.apache.hadoop.yarn.util.FSDownload.call(FSDownload.java:60)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:471)
at java.util.concurrent.FutureTask.run(FutureTask.java:262)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)

Failing this attempt. Failing the application.
15/06/16 11:41:06 INFO mapreduce.Job: Counters: 0

將cdh2和cdh3機器的/data/yarn/nm/usercache/hdfs/filecache下的檔案刪除即可。可參考http://stackoverflow.com/questions/30857413/hadoop-complains-about-attempting-to-overwrite-nonempty-destination-directory

Hadoop+HBase搭建雲端儲存總結 PDF

HBase 結點之間時間不一致造成regionserver啟動失敗

Hadoop+ZooKeeper+HBase叢集配置

Hadoop叢集安裝&HBase實驗環境搭建

基於Hadoop叢集的HBase叢集的配置 ‘

Hadoop安裝部署筆記之-HBase完全分布模式安裝

單機版搭建HBase環境圖文教程詳解

HBase 的詳細介紹：請點這裡
HBase 的：請點這裡

本文永久更新連結地址：

本文章原先以中文撰寫並發佈於 aliyun.com，亦設英文版本，僅作資訊用途。本網站不對文章的準確性，完整性或可靠性或其任何翻譯作出任何明示或暗示的陳述或保證。如對該文章有任何疑慮或投訴，請傳送電郵至 info-contact@alibabacloud.com 並提供相關疑慮或投訴的詳細說明。職員會於 5 個工作天內與您聯絡，一經驗證之後，即會刪除該侵權內容。

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More