How to load large data in hbase with MapReduce jobs-products and technologies

Source: Internet
Author: User
Keywords Large data through nbsp;
How to load large data in http://www.aliyun.com/zixun/aggregation/13713.html ">hbase" Using MapReduce jobs: 2012.04.16 14:51 Source: Blog Author: blog

I. Overview

HBase There are many ways to load data into a table, the simplest way to do this is to invoke the Tableoutputformat method via MapReduce, or to write data to the client by using the API. But that's not the most effective way.

This document will give you a description of how to load large data in HBase. Using the MapReduce job, the data is exported to a file in an organization format within HBase, and then the data file is loaded into a running cluster. (Note: It is generated hfile and then loaded into hbase.) )

Ii. steps for large data loading

The loading of large data contains 2 steps:

1, through the mapreduce of the job data preparation process

First, use the Hfileoutputformat to generate the HBase data file format by MapReduce. The data file in this format is the HBase internal file organization format and is fairly easy to write data to the cluster.

For this method to be more efficient, Hfileoutputformat must be configured so that each output hfile must be adapted to a single region. To achieve this, MapReduce's job employs the Totalorderpartitioner class of Hadoop, which is partitioned to correspond to each region in the table.

At the same time, Hfileoutputformat contains a very convenient method, Configureincrementalload (), which automatically sets a totalorderpartitioner based on the current region boundary of the table.

2. Data loading process

After the data has been prepared by Hfileoutputformat, the data is loaded into the cluster using command-line tools. This command-line tool traverses the prepared data files and determines the region to which each file belongs. Then, when connected to the corresponding region Server, move to hfile to the storage directory to provide data to the user.

If the region boundary changes when data is prepared or data is loaded, HBase is automatically partitioned to accommodate new boundary changes. This process is inefficient, especially with other client data entry operations. So pay attention, try to use less time to create data files and input the data file into the cluster.

3. Use IMPORTTSV to prepare for large data loading

HBase the IMPORTTSV command tool. Use this command with Hadoop Jar/path/to/hbase-version.jar IMPORTTSV. If execution with no parameters prints the following Help information:

Usage:importtsv-dimporttsv.columns=a,b,c

Imports the given input directory of TSV data into the specified table.

The column names of the TSV data moment-in is specified using The-dimporttsv.columns option.

This option takes the form's comma-separated column names, where each column name was either a simple column accessibility, or a columnfam Ily:qualifier.

The special column name Hbase_row_key is used to designate, this column should being Usedas the ROW KEY for each imported record.

You are moment-in specify exactly one column to be the row key.

In order to prepare data for a bulk data load, pass the option:

-dimporttsv.bulk.output=/path/for/output

Other options, May is specified with-d include:

-dimporttsv.skip.bad.lines=false-fail if encountering a invalid line

4. Use Completebulkload to load data

After importing data using IMPORTTSV, Completebulkload is used to import data into the running cluster.

Completebulkload is performed using the same output path and the name of the table as IMPORTTSV. For example:

$ Hadoop jar Hbase-version.jar Completebulkload/user/todd/myoutput mytable

This command is executed very quickly, and the new data is visible in the cluster when it is finished.

5. Advanced Usage

Although IMPORTTSV commands are useful, in many cases, users may need to write code or other forms of import data.

To do this, you can view the Importtsv.java source code and read the Javadoc Help documentation for Hfileoutputformat.

Load large amounts of data by code to focus on the Loadincrementalhfiles class.

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.