ImportTsv-HBase data import tool

Source: Internet
Author: User
HBase officially provides Mapreduce-based batch data import tools: Bulkload and ImportTsv. For more information about Bulkload, see my other article www. linuxi.

HBase officially provides Mapreduce-based batch data import tools: Bulk load and ImportTsv. About Bulk load you can look at my another http://www.linuxi

I. Overview

HBase officially provides Mapreduce-based batch data import tools: Bulk load and ImportTsv. You can refer to another article about Bulk load.

HBase users usually use HBase API derivatives. However, if a large volume of data is imported at a time, a large amount of Regionserver resources may be occupied, affecting the query of other tables stored on the Regionserver, this article will parse the ImportTsv data import tool from the source code to explore how to efficiently import data to HBase.

II. Introduction to ImportTsv

ImportTsv is a command line tool provided by Hbase. It can easily import data files with custom delimiters (Default \ t) stored on HDFS to HBase tables using a command, it is very useful for importing large amounts of data, including two methods to import data to HBase tables:

The first is to use TableOutputformat to insert data into reduce;

The second is a file in HFile format. Execute a command called CompleteBulkLoad to move the file to the HBase tablespace directory and provide it to the client for query.

Iii. source code parsing

Based on CDH5 HBase0.98.1, The ImportTsv portal class is org. apache. Hadoop. hbase. mapreduce. ImportTsv.

String hfileOutPath = conf. get (BULK_OUTPUT_CONF_KEY );
String columns [] = conf. getStrings (COLUMNS_CONF_KEY );
If (hfileOutPath! = Null ){
If (! Admin. tableExists (tableName )){
LOG. warn (format ("Table '% s' does not exist.", tableName ));
// TODO: this is backwards. Instead of depending on the existence of a table,
// Create a sane splits file for HFileOutputFormat based on data sampling.
CreateTable (admin, tableName, columns );
}
HTable table = new HTable (conf, tableName );
Job. setReducerClass (PutSortReducer. class );
Path outputDir = new Path (hfileOutPath );
FileOutputFormat. setOutputPath (job, outputDir );
Job. setMapOutputKeyClass (ImmutableBytesWritable. class );
If (mapperClass. equals (TsvImporterTextMapper. class )){
Job. setMapOutputValueClass (Text. class );
Job. setReducerClass (TextSortReducer. class );
} Else {
Job. setMapOutputValueClass (Put. class );
Job. setCombinerClass (PutCombiner. class );
}
HFileOutputFormat. configureIncrementalLoad (job, table );
} Else {
If (mapperClass. equals (TsvImporterTextMapper. class )){
Usage (TsvImporterTextMapper. class. toString ()
+ "Shocould not be used for non bulkloading case. use"
+ TsvImporterMapper. class. toString ()
+ "Or custom mapper whose value type is Put .");
System. exit (-1 );
}
// No such CERs. Just write straight to table. Call initTableReducerJob
// To set up the TableOutputFormat.
TableMapReduceUtil. initTableReducerJob (tableName, null, job );
Job. setNumReduceTasks (0 );
}

This step starts from the ImportTsv. createSubmittableJob Method to Determine the BULK_OUTPUT_CONF_KEY parameter. This step directly affects the way in which the Mapreduce job of ImportTsv enters the HBase library.

If it is not empty and the user does not have a custom Mapper implementation class (the parameter importtsv. mapper. class), PutSortReducer is used, in which Put is sorted. If there are many columns in each row of records, it will occupy a large amount of memory resources of Reducer for sorting.

Configuration conf = job. getConfiguration ();
HBaseConfiguration. merge (conf, HBaseConfiguration. create (conf ));
Job. setOutputFormatClass (TableOutputFormat. class );

If it is null, call TableMapReduceUtil. initTableReducerJob initializes the Reducer output of TableOutputformat. In this mode, CER is not required, because the Put API is called to submit data to the Regionserver in batches directly in the Outputformat of mapper (equivalent to executing the HBase Put API in parallel)

Iv. Practice

1. Use the Put API of TableOutputformat to upload data, instead of bulk-loading

$ Bin/hbase org. apache. hadoop. hbase. mapreduce. ImportTsv-Dimporttsv. columns = a, B, c

2. Use bulk-loading to generate StoreFiles (HFile)

Step 1. Generate Hfile

$ Bin/hbase org. apache. hadoop. hbase. mapreduce. ImportTsv-Dimporttsv. columns = a, B, c-Dimporttsv. bulk. output = hdfs: // storefile-outputdir

Step 2. Complete Import

$ Bin/hbase org. apache. hadoop. hbase. mapreduce. LoadIncrementalHFiles

V. Summary

When using ImportTsv, be sure to pay attention to the importtsv parameter. bulk. generally, the Bulk output method is more user-friendly for Regionserver. Loading data in this way hardly occupies the computing resources of Regionserver, because the HFile file is moved on HDFS, then, the HMaster is notified to launch one or more regions of the Regionserver.

Hadoop + HBase cloud storage creation summary PDF

Regionserver startup failed due to inconsistent time between HBase nodes

Hadoop + ZooKeeper + HBase cluster configuration

Hadoop cluster Installation & HBase lab environment setup

HBase cluster configuration based on Hadoop cluster'

Hadoop installation and deployment notes-HBase full distribution mode installation

Detailed tutorial on creating HBase environment for standalone Edition

HBase details: click here
HBase: click here

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.