HBase Incremental Backup

Source: Internet
Author: User
Tags export class
Apis involved in this article: HadoopHDFS: hadoop. apache. orgcommondocscurrentapiHBase: hbase. apache. orgapido

Apis involved in this article: Hadoop/HDFS: http://hadoop.apache.org/common/docs/current/api/ HBase: http://hbase.apache.org/apido

APIS involved in this article:

Hadoop/HDFS:

HBase :? Overview-summary.html

Begin!

I. Overview

The Export and Import tools provided by HBase are used.

Export:

Import:

The directory where these two classes are located shows that Export and Import are essentially MapReduce tasks.

The APIs of these two tools are clearly written:

Export an HBase table. Writes content to sequence files up in HDFS. Use Import to read it back in again.

Export the HBase table as the sequence files of HDFS.

The Export is just an Export tool. How does one complete the backup function?

Ii. Functional Experiment

The test process involves a lot of data. Here we only provide the important conclusions:

1. Export exports data in units of tables. If you want to complete the full-database backup, you need to execute n times.

2. The Calling method of Export in shell is similar to the following format:

./Hbase org. apache. hadoop. hbase. mapreduce. Export table name backup path (Version Number) (start timestamp) (end timestamp)

Export [-D ] * [ [ [ ]

Optional in parentheses, for example

./Hbase org. apache. hadoop. hbase. mapreduce. Export 'contenttbl'/home/codeevoship/contentBackup20120920 1 123456789

Back up the contentTbl table to the/home/codeevoship/contentBackup20120920 directory (the last level directory must be created by Export itself). The version number is 1, the backup record starts from the timestamp 123456789 to all records that have performed the put operation in the current time.

Note: Why are all put Operation Records? During backup, all records whose timestamps are greater than or equal to 123456789 are scanned and exported. If it is a delete operation, this record in the table has been deleted and cannot be obtained during scanning.

3. If no timestamp is specified, the data in the current complete table is backed up.

Iii. Implementation Details

1. How to delete data during Incremental backup?

Because Export timestamp-based backup can only reflect Put table items, if I delete an existing record in a backup (incremental package) time range, when the database is archived, this deleted record will appear in my table again.

Therefore, I replace all delete operations with Put operations:

A. An invalid flag is added to each row of data. When deleting a record, use Put to write the flag to 1.

B. In a single query, after a record is retrieved Based on the rowKey, the system determines whether the record has been "deleted" based on the flag to determine whether to return the record. When multiple queries (scan), use the column value filter to filter out all records whose flag position is not 1. (See my previous HBase condition query.)

2. Will the newly added data during the backup affect the accuracy of the backup content?

You can specify an end timestamp that is less than or equal to the current time to specify the data range to be backed up.

3. How do I back up data to other machines?

A. Export supports address backup. The simplest method is to directly mount the remote storage to a local directory and then use the local path.

B. If file: // home/codeevoship/backup is used in the Path when calling an API, the local file system is used. If it is written to/home/codeevoship, the path of the HDFS layer is used. The reverse is used in Shell calls.

4. How can I use APIs for calling?

MapReduce Job:

Create a Job instance using the method provided by the Export class, and then call the Job () or (boolean verbose); asynchronous and synchronous.

4. Other solutions

1. HDFS Replication or DistCp at the HDFS Layer

2. Cluster Replication

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.