International - English

Cart Console

Topic Center

Contact Sales

Home > Internet > Big Data

How to load large data in hbase with MapReduce jobs-products and technologies

Last Update:2014-12-18 Source: Internet

Author: User

Keywords Large data through nbsp;

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

How to load large data in http://www.aliyun.com/zixun/aggregation/13713.html ">hbase" Using MapReduce jobs: 2012.04.16 14:51 Source: Blog Author: blog

I. Overview

HBase There are many ways to load data into a table, the simplest way to do this is to invoke the Tableoutputformat method via MapReduce, or to write data to the client by using the API. But that's not the most effective way.

This document will give you a description of how to load large data in HBase. Using the MapReduce job, the data is exported to a file in an organization format within HBase, and then the data file is loaded into a running cluster. (Note: It is generated hfile and then loaded into hbase.) ）

Ii. steps for large data loading

The loading of large data contains 2 steps:

1, through the mapreduce of the job data preparation process

First, use the Hfileoutputformat to generate the HBase data file format by MapReduce. The data file in this format is the HBase internal file organization format and is fairly easy to write data to the cluster.

For this method to be more efficient, Hfileoutputformat must be configured so that each output hfile must be adapted to a single region. To achieve this, MapReduce's job employs the Totalorderpartitioner class of Hadoop, which is partitioned to correspond to each region in the table.

At the same time, Hfileoutputformat contains a very convenient method, Configureincrementalload (), which automatically sets a totalorderpartitioner based on the current region boundary of the table.

2. Data loading process

After the data has been prepared by Hfileoutputformat, the data is loaded into the cluster using command-line tools. This command-line tool traverses the prepared data files and determines the region to which each file belongs. Then, when connected to the corresponding region Server, move to hfile to the storage directory to provide data to the user.

If the region boundary changes when data is prepared or data is loaded, HBase is automatically partitioned to accommodate new boundary changes. This process is inefficient, especially with other client data entry operations. So pay attention, try to use less time to create data files and input the data file into the cluster.

3. Use IMPORTTSV to prepare for large data loading

HBase the IMPORTTSV command tool. Use this command with Hadoop Jar/path/to/hbase-version.jar IMPORTTSV. If execution with no parameters prints the following Help information:

Usage:importtsv-dimporttsv.columns=a,b,c

Imports the given input directory of TSV data into the specified table.

The column names of the TSV data moment-in is specified using The-dimporttsv.columns option.

This option takes the form's comma-separated column names, where each column name was either a simple column accessibility, or a columnfam Ily:qualifier.

The special column name Hbase_row_key is used to designate, this column should being Usedas the ROW KEY for each imported record.

You are moment-in specify exactly one column to be the row key.

In order to prepare data for a bulk data load, pass the option:

-dimporttsv.bulk.output=/path/for/output

Other options, May is specified with-d include:

-dimporttsv.skip.bad.lines=false-fail if encountering a invalid line

4. Use Completebulkload to load data

After importing data using IMPORTTSV, Completebulkload is used to import data into the running cluster.

Completebulkload is performed using the same output path and the name of the table as IMPORTTSV. For example:

$ Hadoop jar Hbase-version.jar Completebulkload/user/todd/myoutput mytable

This command is executed very quickly, and the new data is visible in the cluster when it is finished.

5. Advanced Usage

Although IMPORTTSV commands are useful, in many cases, users may need to write code or other forms of import data.

To do this, you can view the Importtsv.java source code and read the Javadoc Help documentation for Hfileoutputformat.

Load large amounts of data by code to focus on the Loadincrementalhfiles class.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

Related Keywords:

Database operation and maintenance personnel should know thes... 10-23

"Prism Gate" provides a model for people to reflect on person... 04-30

There is no shortage of data mining talent in China, but it i... 04-30

In order to get "big data", Strategic investment Love Station... 04-30

The internet industry in China, we still keep the data very t... 04-29

News client The biggest gold mine is big data 04-27

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

Hot Article

Hot Tags

computing conference access forum computer class data get http html applications

Popular Keywords

direct digital landing development documentation data user director of marketing deploy it ddos how to description of products and services ddos information data website domain to dns

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

How to load large data in hbase with MapReduce jobs-products and technologies

Contact Us

Hot Article

Hot Tags

Popular Keywords

Recommend Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support