1. Description of the problem
For some reason, the cluster needs to be re-deployed on a Cloudera CDH cluster that has already been deployed, after redeployment, because Cloudera Manager defaults to dfs.namenode.checkpoint.period and Dfs.namenode.checkpoint.txns are 1 hours and 1000000 respectively. As soon as one of these two conditions is reached, Secondarynamenode performs the checkpoint operation, and the following problem occurs:
Error:the Health test result for Name_node_ha_checkpoint_age have become bad:the filesystem CHECKPOINT is 4 hour (s) old. This is 401.25% of the configured checkpoint period of 1 hour (s). Critical threshold:400.00%. 2,793 transactions has occurred since the last filesystem checkpoint. This is 0.28% of the configured checkpoint transaction target of 1,000,000.
After a preliminary analysis, because Secondarynamenode did not perform the checkpoint cause, so they looked at the Secondarynamenode log, found that the real error is:
Error:exception in Docheckpoint java.io.IOException:Inconsistent checkpoint field
At this point, it is important to look at the logs running by a role, and to pinpoint the error.
So what is the connection between the two issues? The main point is that Secondarynamenode does not perform a checkpoint, resulting in the above error, which indicates that you have not performed the checkpoint operation. The following error indicates that the checkpoint operation failed and was not executed.
2, the problem before the solution of the Knowledge reserve
before solving the problem, we need to introduce the role and importance of the checkpoint. .
(1) Checkpoint
What is a checkpoint: The checkpoint is set for Secondarynamenode by setting the parameters in the Hdfs-site.xml Dfs.namenode.checkpoint.period and Dfs.namenode.checkpoint.txns to trigger, as long as one of these two conditions can be set to go Secondarynamenode perform the checkpoint operation.
(2) The contents of the checkpoint:
Secondarynamenode performs the checkpoint by first reading the fsimage from the Namenode and performing the operations in the Editslog file in Namenode and eventually generating a new Fsimage file. and pass this file on to Namenode. Note: In this process, if Editlog does not have any records, the checkpoint condition is reached, and no checkpoint is performed because no changes have occurred.
(3) The function of the checkpoint:
Secondarynamenode the operation of this checkpoint can reduce the startup time of Namenode.
3. Solution to the problem
Through a true description of the error, the discovery is primarily a version mismatch, stating that when reinstalling CDH, the data of the previous version of CDH was retained, resulting in inconsistent version issues, so that Secondarynamenode did not perform the checkpoint operation. then the solution is to delete the previous data, so by removing Secondarynamenode execution checkpoint is the directory, that is, the Hdfs-site.xml parameter Fs.checkpoint.dir, The path to the value of the Dfs.namenode.checkpoint.dir.
After deletion, restart the cluster.
Checkpoint (checkpoint) problem in HDFs