This accident occurred in the environment of testing, although not the environment, but also a more valuable accident.
Cause: The company has a cluster of Hadoop, used to run the index, PHP users, call the program to build the index, found that the MapReduce cluster does not start up, reported IOException exception, the specific exception is not recorded, roughly meaning that the disk space is full, resulting in the creation of a file failure.
The following scattered fairy simulation of the environment, after receiving the problem, the first thing is to check the CentOS system disk utilization
Execute command df-h to view current occupancy:
Filesystem Size used Avail use% mounted on
/dev/mapper/volgroup-lv_root
11G 8.7G 1.3G 100%/
tmpfs 1.9G 0 1.9G 0%/dev/shm
/dev/sda1 485M 37M 423M 8%/ Boot
The discovery disk uses 100%, causes the space to be insufficient, thus causes the Hadoop to start the job, needs to establish the temporary file space to have not, therefore appeared, the article begins the scene.
Find the reason, it is good to do, to check the current system under the file occupancy, delete a few more than the space is larger and irrelevant files, of course, we are in the test environment, the general line mounted on the disk are relatively large, such anomalies, should be very small.
Execute command: Ll-h View the size of some file directories
This command scattered the fairy test, some times, not very good, so use the following command
Du-sh * View Space file Usage:
[Search@bjdevfse02 ~]$ du-sh *
4.0K beginzk.sh
4.0K clearhadoop.sh
0 hadoop
95M hadoop-1.2.1
214M hadoop-2.2.0
152K hadoopconf
345M hadoop-dd
4.0K Script
0 solr
188M solr-4.3.0
52M solr-4.3.1
704K solrconf
4.0K stopzk.sh
4.0K synconf.sh
36K tmp
0 zk
8.0K zkconf
39M zkdata
40M zookeeper-3.4.5
4.0K zookeeper.out
After deleting several files, the disk rate reaches the requirement of a start Mr Job, and then when the Mr Job is run again, it is reported that the log found that Hadoop is entering safe mode due to disk full, which causes the commit to fail with the following exception:
after you know the reason, exit Safe mode by executing the following command
Hadoop Dfsadmin-safemode Leave
Once the Mr Job is submitted again, it runs normally.
Summarize:
1, encountered a problem, the first reaction, as far as possible first to the original information, the exception of what the retention, easy to analyze, some may not log records, or log compare large search inconvenient, with mobile phone photos, or paste copy what.
2, according to the exception information, as far as possible direct accurate abnormal reason, if does not locate, may also need to analyze the recent days the system in the change, then each position, excludes.
3, to solve the success, as far as possible to record, what happened, then the method of elimination, and so on some experience, and finally, to share to the team or colleagues, to avoid the occurrence of such a similar thing in the future, or after the occurrence, easy to quickly recover according to the document is very important.