Recently in the application of Hadoop cluster, encountered the task to submit the cluster, long-time card in the accepted state, the application of resources difficult situation, after a series of log analysis, the state of the investigation, only to find that the namenode has been caused by the primary and standby switch, The previous Namenode primary node has been down for some reason, causing the primary and standby switchover to occur, just the time card at the peak of business, some data blocks are not synchronized (or other reasons, from the log should be unable to interact with the Journalnode cluster, guessing possible network anomalies, but not the relevant log information), Namenode after the switch was forced into safe mode has not come out, resulting in subsequent tasks more and more difficult to apply for resources, although the cluster is intact, the business almost collapsed ...
The reasons for downtime are yet to be analyzed, and the immediate recovery requirements for the business are higher, and the Namenode security model becomes the most important issue.
After a search, I found the blog on the right: "Hadoop security model detailed and Configuration"
In simple terms, HDFs in Safe Mode only supports the operation of metadata, does not support the creation of files, delete, and so on, the data block is also a large number of checks, resulting in the allocation of resources and the application time is far exceeding expectations.
Therefore, when the business emergency requires recovery, you can try to reduce the following two parameters, so that the security mode to end as soon as possible:
dfs.namenode.replication.min--minimum number of copies to meet your needs
Proportion of data blocks in a dfs.namenode.safemode.threshold-pct--cluster that meet normal configurations
< Property> <name>Dfs.namenode.replication.min</name> <value>1</value> <Description>Minimal block replication. </Description></ Property>< Property> <name>dfs.namenode.safemode.threshold-pct</name> <value>0.999f</value> <Description>specifies the percentage of blocks that should satisfy the minimal replication requirement defined by Dfs.namen Ode.replication.min. Values less than or equal to 0 mean not to wait for any particular percentage of blocks before Exiting SafeMode. Values greater than 1 would make safe mode permanent. </Description></ Property>
If you are trying to prevent some exception data from always being verified, You can try to restart Namenode after you have set the above dfs.namenode.safemode.threshold-pct parameter to 0 or a value smaller than 0, which will never enter Safe mode, or manually exit Safe mode with the following command:
Hadoop Fs–safemode Leave
Hadoop NameNode SafeMode