Hdfsblock loss of excessive access to Safe Mode (SafeMode) solution

Last Update:2017-08-10 Source: Internet

Author: User

Keywords Safe Mode Safemode hdfs client Hdfsblock lost

Tags access block check client data directory disk disk space

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

HDFS block loses excessive access to Safe mode solutions

Description of background and phenomena (Background and symptom)

Due to insufficient disk space, insufficient memory, system power off and other causes Datanode DataBlock lost, the following similar log:

The number of live Datanodes 3 super-delegates reached the minimum number 0.

Safe mode is turned off automatically once the thresholds have been.

caused By:org.apache.hadoop.hdfs.server.namenode.SafeModeException:Log not rolled.

Name node is in safe mode.

The reported blocks 632758 needs additional 5114 to reach the blocks 0.9990

of total blocks 638510.

The number of live Datanodes 3 super-delegates reached the minimum number 0.

Safe mode is turned off automatically once the thresholds have been.

At Org.apache.hadoop.hdfs.server.namenode.FSNamesystem.checkNameNodeSafeMode

(fsnamesystem.java:1209)

... More

Causal Analysis (INCORPORATED) *

The system automatically enters Safe mode because the system loses power and has insufficient memory, which causes Datanode to lose more than the lost percentage set.

Solution (Solution) *

Install the HDFS client and execute the following command:

Step 1 Execute Command exit Safe mode: Hadoop dfsadmin-safemode leave

Step 2 Perform a health check to remove the damaged block. HDFs fsck/-delete

Note: Data loss will occur in this way, and damaged blocks will be deleted

Cluster machine unexpectedly power cycle, resulting in hbase can not start normally, throw reflect invocation exception, may be performing an insert or merge operations such as the interruption to half of the interrupt, resulting in partial data file incomplete format or on the HDFS block blocks incomplete.

On the Internet to check the relevant information, suspected that there may be some uncommitted changes before the log file data stored in half of the file is incomplete, so the hbase.hlog.split.skip.errors to true to try.

Explanation of the function of this parameter:

When the server runs, restart, there will be a playback process, the log below/hbase/wal/log are replayed, merged into each region, playback process If there is error occurred, this parameter is False, then exception will be output to the outer layer, Playback failed.

Unfortunately, the HBase cluster still fails to start correctly after this parameter has been modified.

And then for other reasons, first observe the 60010 monitoring page at the start of the HBase,

Found part of Region Failed_open error, Its007-meta table altogether 200 region, only 199 successful.

Seems to think of what, by the way, is probably the region data file format is not correct, then first check its HDFs file on the normal.

Sure enough, looking at 50070 pages of Hadoop would prompt the Hadoop file system to have two blocks of errors on the specific path.

(About HBase on the HDFs directory related article links: hbase on the HDFs directory tree)

Workaround:

1. Run Hadoop fsck/-files Check HDFs file

2. Found a corrupted file in the/hbase/oldwals directory,

Run Hadoop fsck/-delete purge corrupted files

3. Run HBase hbck-details View hbase overview, found that table Its007-meta has a region load failed

4. Run HBase Hbck-fixmeta try to repair the system metadata table

5. Run HBase Hbck-fix try to fix region data inconsistency problem.

6. Run HBase again hbck-details found the problem is still not fixed, the region still failed to load.

Therefore, directly remove the error file under the region, temporarily move to the HDFs root directory

Hadoop fs-move/hbase/data/default/its007-meta/fe6463cba743a87e99f9d8577276bada/meta/ 9a853fdbe13046fca194051cb9f69f9b/

Fe6463cba743a87e99f9d8577276bada is region's name.

9A853FDBE13046FCA194051CB9F69F9B is a region error hfile, has a 800k size (note: a region can have more than hfile)

7. Run HBase Hbck-fix failed before the region, complete the repair, discarded the error hfile

Summary：

HBase on HDFs A total of two files were damaged. (about HDFs file write related article: HDFs file write related concepts)

One is under the oldwals, this is to store some useless hlog files, where there are file damage, the transfer from the wals of useless hlog written to the Oldwals, wrote half of the power loss caused HDFs file data block error;

Another is region the next hfile file is corrupted, this file 800k is smaller, should be from Memstore flush to hfile, write half did not finish the file data block that caused it on the HDFS error.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More