Half an hour in the Hadoop emergency room: Dynamically set the adjustment log level
- Technology
- Big Data snail
-
2015.02.03
This article through a practical work encountered in the online problem to tell the vast number of data practitioners a popular useful philosophy of life: online encounter Such a problem, we must calm, the more anxious the more prone to trouble! Impatient to eat hot tofu.
Urgent
In Tuesday, the Hadoop Cluster service at the friend company was unavailable and lasted from 9 o'clock in the morning to 12. The business party urged the more urgent, hope to recover as soon as possible, at least to a recoverable point in time. This mood has been done on-line service operation and maintenance students should be able to understand. Especially in the absence of any ideas, you can only anxious!
Symptom understanding
Friends contact me, consulted the following specific symptoms for Namenode during the startup process, always print the following log:
This situation has not met before, asked the following version is currently used 2.4.0, see log is the info level, judge the data should be no problem.
Access information
This problem is generally direct to Google Hadoop Jira
Open the first link, search keyword: does not belong to any file
General view found similar to https://issues.apache.org/jira/browse/HDFS-7503 description phenomenon
In general, if you restart the cluster immediately after deleting a large number of files, the Namenode will be in safe mode for a long period of time due to a large number of printed free block information, resulting in Namenode being unavailable for a long time. This issue will be fixed in the 2.6.1 and 1.3.0 versions.
Information Confirmation
Confirmed with friends, indeed, the day before the operation of a large number of deleted files, the number of deleted files up to more than 700 W. Probably can determine the situation and the Jira mentioned in the agreement. A rough estimate is that if you print 100 info logs per second, then more than 700 W will take about 1 days to print. The most straightforward solution is to lower the log level.
Action: Dynamically set the adjustment log level
Do not restart reduce the log level of Namenode, open http://{your_namenode_ip}:50070/loglevel
Check the source code, find the full path of the class that prints this log, enter
Org.apache.hadoop.hdfs.server.blockmanagement.BlockManager View Log level is info, set it to "WARN", view Namenode's latest log, no change, wait a while, continues to print, the problem is not resolved.
The decision should be that the log category is not tuned, and continue to view the source code for printing this log:
The specific print log is Blocklog
In fact the corresponding log category should be: Blockstatechange! In fact, from the print out of the log can be seen.
Enter "Blockstatechange" in log, level enter "WARN", then click "Set Log Level" button.
Check Namenode log, log stop immediately, but also print other information, confirm the effective, wait 2-3 minutes, log back to normal.
The test can upload and download data, verify that the indicators are normal, cluster recovery is available. The whole repair process took half an hour.
Online encounter Such a problem, we must calm, the more anxious the more prone to trouble!
(Source: Civilian big Data)
Hadoop dynamic Settings Adjust log level