Hadoop + Hbase cluster data migration
Data migration or backup is a possible issue for any company. The official website also provides several solutions for hbase data migration. We recommend using Hadoop distcp for migration. It is suitable for data migration between large data volumes or cross-version clusters.
Version
Hadoop2.7.1
Hbase0.98.12
A problem found during the use of Hadoop distcp during the migration of hbase data of the same version today:
This error occurs because the source file size is inconsistent with the target file size. The cause of this error is unclear. Then, you can search for similar errors on the Internet and find no similar examples, some are
The preceding error indicates that the crc file checksum does not match. The file size is inconsistent. After three retries, the error is similar. So I tried to find the answer in the hadoop official documentation, found in the official website documentation distcp
There is an update parameter, which is explained on the official website as follows:
What does it mean?
This means that if the size, block size, or checksum of the source and target files are inconsistent during the re-copy process, the source file will be forcibly replaced with the target file.
Do not use it. Be cautious when using it because it may change the target path.
For example:
Assume that the data of cluster A is to be migrated to cluster B, and the Hbase structure directory is consistent:
The data migration directory of cluster A is as follows:
/Data/01/
/Data/01/B
/Data/01/c
/Data/01/d
/Data/01/e
Ideally, the Directory of Cluster B migrated to is the same as that of cluster:
/Data/01/
/Data/01/B
/Data/01/c
/Data/01/d
/Data/01/e
However, after-update is used, it is likely to become the following directory structure:
/Data/01
/Data/
/Data/B
/Data/c
/Data/d
/Data/e
In this case, the update document has already described it, because when using this command, it will force keep any information about the source file, including the path, in this way, only 100% of the copied data cannot be changed. Although the directory is misplaced, the data is correct. You can solve this problem with a tip, if you already know that a job will encounter this situation, you should complete the path of its directory in advance, so that you do not need to manually move the file to the correct directory. For example, my original migration command is as follows:
Hadoop distcp hdfs: // 10.0.0.100: 8020/hbase/data/default/ETLDB hdfs: // 10.0.0.101: 8020/hbase/data/default
The data can be migrated correctly, but if update is used, the following path should be used. Note that the table name is added to the target path. If the table name does not exist
Hadoop distcp-update hdfs: // 10.0.0.100: 8020/hbase/data/default/ETLDB hdfs: // 10.0.0.101: 8020/hbase/data/default/ETLDB
Imagine that if your hbase table has more than 10000 region, it means that you need to process these 10000 misplaced directories into the correct directory, although writing a script can also be automated, it takes a long time, and who can ensure that the script will not be faulty, so it is not recommended to fix it later.
After the migration is complete, start the hbase cluster service and execute the following two commands to restore metadata. Otherwise, the hbase cluster will not recognize the newly migrated table:
./Hbase hbck-fix
./Hbase hbck-repairHoles
Summary:
(1) If there is a problem and you don't need to worry about it, you can search for a similar exception on google first. If not, you need to read the distcp documentation parameter on the official website, note that the document version and your hadoop version must be consistent. Otherwise, some parameters may be obsolete or not supported.
(2) If an IO exception occurs in xxx file not exist when distcp is a large directory, you can try to reduce the number of copied file directories. If it still fails, you need to go back to method 1 to find the problem. In most cases, it is not easy for us to copy a small number of directories.
Reference:
Http://hadoop.apache.org/docs/r2.7.1/hadoop-distcp/DistCp.html
You may also like the following articles about Hadoop:
Tutorial on standalone/pseudo-distributed installation and configuration of Hadoop2.4.1 under Ubuntu14.04
Install and configure Hadoop2.2.0 on CentOS
Build a Hadoop environment on Ubuntu 13.04
Cluster configuration for Ubuntu 12.10 + Hadoop 1.2.1
Build a Hadoop environment on Ubuntu (standalone mode + pseudo Distribution Mode)
Configuration of Hadoop environment in Ubuntu
Detailed tutorial on creating a Hadoop environment for standalone Edition