Bulk load-hbase data import Best Practices

Source: Internet
Author: User

I. Overview

HBase itself provides a number of ways to import data, usually in two common ways:

1. Using the Tableoutputformat provided by HBase, the principle is to import data into HBase via a mapreduce job

2. Another way is to use the HBase Native Client API

These two ways because of the need to frequent with the data stored regionserver communication, one-time storage of large amounts of data, especially resource-intensive, so is not the most effective. Understand the hbase underlying principle should know that hbase in HDFs is stored in the hfile file structure, a more efficient and convenient way is to use the "Bulk Loading" method to generate hfile directly, That is, HBase provides the Hfileoutputformat class.


Second, Bulk load Fundamentals

Bulk load handling consists of two main steps

1. Prepare data file

The first step of Bulk load is to run a mapreduce job where the Hfileoutputformat output hbase data file is used: StoreFile. The function of the Hfileoutputformat is to make the output of the hfile file adaptable to a single region, using the Totalorderpartitioner class to partition the map output into various key intervals. Each key interval corresponds to the region of the HBase table.

2. Import HBase Table

The second step is to use the Completebulkload tool to give the first step result file to the regionserver that is responsible for the file corresponding to region, and move the file to the region on the HDFs storage directory, once completed, open the data to clients.

The Completebulkload tool automatically split the data file to the new boundary if the boundaries of the region have been changed at the point where the bulk load is ready to be imported or at the threshold of preparing the import and completing the import, but this process is not a best practice. As a result, users need to minimize the delay between importing and importing the cluster, especially when other clients are using other tools to import data to the same table at the same time.


Attention:

The completebulkload step of bulk load is simply to import the result file of IMPORTTSV or hfileoutputformat into a table, using a command similar to the following

Hadoop jar Hbase-version.jar completebulkload [-c/path/to/hbase/config/hbase-site.xml]/user/todd/myoutput mytable

The command will be completed shortly, and the hfile file under/user/todd/myoutput will be imported into the MyTable table. Note: If the target table does not exist, the tool will automatically create the table.


Third, generate hfile program Description: 1, the final output, whether it is a map or reduce, the output part of key and value type must be: < Immutablebyteswritable, keyvalue> or < Immutablebyteswritable, put>.
2, the final output part, the value type is keyvalue or put, the corresponding sorter are keyvaluesortreducer or Putsortreducer respectively.
3, Mr Example Job.setoutputformatclass (Hfileoutputformat.class); Hfileoutputformat is only suitable for organizing hfile files once for single-row families.
4, Mr Example Hfileoutputformat.configureincrementalload (job, table), the job is automatically configured. Simpletotalorderpartitioner is the need to sort the key first, then divide into each reduce, to ensure that each reducer in the key minimum maximum interval range, there will be no intersection. As a whole, the region,key is absolutely orderly because it is in the warehouse to HBase.
5. The last generation of hfile in the Mr Example is stored on HDFs, and the subdirectories under the output path are the individual column families. If the hfile is inbound hbase, which is equivalent to move hfile to the region of HBase, the column family contents of the hfile subdirectory are not available.


Iv. Examples

1. Create a table

Create ' hfiletable ', ' fm1 ', ' fm2 '

2. Prepare raw Data

Key1fm1:col1value1key1fm1:col2value2key1fm2:col1value3key4fm1:col1value4

3. Import HBase MR

Import Org.apache.hadoop.conf.configuration;import Org.apache.hadoop.fs.fsshell;import Org.apache.hadoop.fs.Path; Import Org.apache.hadoop.hbase.hbaseconfiguration;import Org.apache.hadoop.hbase.client.htable;import Org.apache.hadoop.hbase.client.put;import Org.apache.hadoop.hbase.io.immutablebyteswritable;import Org.apache.hadoop.hbase.mapreduce.hfileoutputformat2;import Org.apache.hadoop.hbase.mapreduce.loadincrementalhfiles;import Org.apache.hadoop.hbase.util.bytes;import Org.apache.hadoop.io.longwritable;import Org.apache.hadoop.io.text;import Org.apache.hadoop.mapreduce.job;import Org.apache.hadoop.mapreduce.mapper;import Org.apache.hadoop.mapreduce.lib.input.fileinputformat;import Org.apache.hadoop.mapreduce.lib.output.fileoutputformat;import Org.slf4j.logger;import org.slf4j.LoggerFactory;    Import Java.io.ioexception;public class Bulkload {static Logger Logger = Loggerfactory.getlogger (Bulkload.class); public static class Bulkloadmap extends Mapper<longwritable, text, immutablebyteswritable, put> {public void map (longwritable key, text value, context context)            Throws IOException, interruptedexception {string[] valuestrsplit = value.tostring (). Split ("\ t");            String hkey = valuestrsplit[0];            String family = Valuestrsplit[1].split (":") [0];            String column = Valuestrsplit[1].split (":") [1];            String hvalue = valuestrsplit[2];            Final byte[] RowKey = bytes.tobytes (hkey);            Final immutablebyteswritable HKey = new immutablebyteswritable (RowKey);            Put hput = new put (RowKey);            byte[] cell = bytes.tobytes (Hvalue);            Hput.add (Bytes.tobytes (family), bytes.tobytes (column), cell);        Context.write (HKey, hput); }} public static void Main (string[] args) throws Exception {Configuration conf = hbaseconfiguration.create (        );        String InputPath = args[0];        String OutputPath = args[1]; Htable htable = NULL;            try {Job Job = job.getinstance (conf, "Exampleread");            Job.setjarbyclass (Bulkload.class);            Job.setmapperclass (BulkLoad.BulkLoadMap.class);            Job.setmapoutputkeyclass (Immutablebyteswritable.class);            Job.setmapoutputvalueclass (Put.class);            Speculation Job.setspeculativeexecution (false);            Job.setreducespeculativeexecution (FALSE);            In/out Format//job.setinputformatclass (Wholefileinputformat.class);            Job.setoutputformatclass (Hfileoutputformat2.class);            Fileinputformat.setinputpaths (Job, InputPath);            Fileoutputformat.setoutputpath (Job, New Path (OutputPath));            htable = new Htable (conf, args[2]);            Hfileoutputformat2.configureincrementalload (Job, htable);                if (Job.waitforcompletion (True)) {Fsshell shell = new Fsshell (conf); try {shell.run (new string[]{"-chmod", "-R", "777",ARGS[1]});                    } catch (Exception e) {logger.error ("Couldnt change the file permissions", E);                throw new IOException (e); }//Load generated hfiles into table loadincrementalhfiles loader = new Loadincrementalhfil                ES (CONF);            Loader.dobulkload (New Path (OutputPath), htable);                } else {System.out.println ("Loading failed.");            System.exit (1);            }} catch (IllegalArgumentException e) {//No region server, table does not exist.        E.printstacktrace ();            } finally {if (htable! = null) {htable.close (); }        }    }}

4. View data

HBase (main):003:0> scan ' hfiletable ' ROW Column+cell                                                                                                        Key2 column=fm1:col1, timestamp=1437794332921, value=value1                                                 Key2                                                                                                        Column=fm1:col2, timestamp=1437794332921, value=value2 Key2 column                                                                                                        =fm2:col1, timestamp=1437794332921, Value=value3 Key3 Column=fm1:col1, timestamp=143779433292         1, Value=value4                                                                                              2 row (s) in 0.1910 seconds 

v. Summary

Although the IMPORTTSV tool is used with most scenarios, users sometimes want to programmatically generate data, or import data in other formats, such as IMPORTTSV need to determine each data column dimension before import, once the dimensions of our data are based on the data content itself, IMPORTTSV will not be able to meet the demand, then need to transform the tool, you can view the Javadoc of Importtsv.java and Hfileoutputformat.

Completebulkload can also be programmed to view the Loadincrementalhfiles class.







Copyright NOTICE: This article for Bo Master original article, without Bo Master permission not reproduced.

Bulk load-hbase data import Best Practices

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.