Problem Description
The department's Hadoop cluster has been running for one months and today needs to be tweaked, but it suddenly turns out that Hadoop is not shutting down properly.
Hadoop version: 2.6.0
The details are as follows:
[Root@master ~]# stop-dfs.sh
stopping namenodes on [master]
Master:no namenode to stop
Slave2:no Datanode to Stop
Slave1:no datanode to stop
...
problem Reason
Execute JPS, find Namenode,datanode and other processes are running normally. Satisfied with the stuffy.
Who reported wrong to find who it, so began to read the hadoop-daemon.sh script file, the result of finding the cause of the problem.
First find the location of the error, in the last few lines of the file:
If [f $pid]; Then
target_pid= ' Cat $pid '
if kill-0 $TARGET _pid >/dev/null 2>&1; then
echo stopping $command
kill $TARGET _pid sleep
$HADOOP _stop_timeout
if kill-0 $TARGET _pid >/dev/null 2>&1; then
echo " $command did not stop gracefully after $HADOOP _stop_timeout seconds:killing with kill-9 "
kill-9 $TARGET _pid
fi< C9/>else
Echo no $command to stop
fi
rm-f $pid
Else
Echo no $command to stop
fi
There's a lot of code, we just look at the parts we care about:
If [f $pid]; Then.
#省略n多行
Else
Echo no $command to stop
fi
It is obvious that if the PID file does not exist, it will print: no XXX to stop
So what is the PID file, why does not exist, find the PID variable declaration statement, in the script file line 107th:
pid= $HADOOP _pid_dir/hadoop-$HADOOP _ident_string-$command. PID #第107行
And then look for the declaration section of the HADOOP_PID_DIR variable:
First in the Script Annotation section to find a key word:
# Hadoop_pid_dir The PID files are stored./tmp by default.
As we know, the Hadoop_pid_dir variable holds the path of the PID file's storage. The default is stored in the/tmp directory with the following code:
If ["$HADOOP _pid_dir" = ""]; Then //97~99 line
hadoop_pid_dir=/tmp
fi
So what is this PID file? When Hadoop is started, it stores the process's PID number in a file so that the process PID can be used to shut down processes while executing the Stop-dfs script.
Now the reason for the problem is clear, that is, the Hadoop-*.pid file in the/tmp directory is not found. solve the problem
Let's see what else is in the/tmp directory:
[Root@slave1 ~]# ll/tmp/
srwxr-x---1 root root 0 Mar 13:39 aegis-<guid ( 5A2C30A2-A87D-490A-9281-6765EDAD7CBA) >
drwxr-xr-x 2 root 4096 Apr 13:55 hsperfdata_root
Srwxr-x---1 root 0 Mar 13:39 qtsingleapp-aegisg-46d2-0
srwxrwxrwx 1 root 0 Mar 13:39 qtsin gleapp-aegiss-a5d2-0
Well, we have everything but what we need.
We know that/tmp is a temporary directory, the system will periodically clean up the files in that directory. Obviously put the PID file here is not reliable, the PID file is not visited for a long time, has been cleaned up!
Now that Hadoop doesn't know what processes need to be shut down, we can only turn it off manually.
First use Ps-ef to view namenode\datanode and other processes of the PID, and then use kill-9 to kill can.
Restart Hadoop and look at the changes in the/tmp directory, with a few more files:
[Root@master ~]# ll/tmp
-rw-r--r--1 root 6 Apr 13:39 hadoop-root-namenode.pid-rw-r--r--1 root root 6 Apr 13:39 hadoop-root-secondarynamenode.pid-rw-r--r--1 root 6 Apr 13:55 yarn-root-resourcemanager.pid
drwxr-xr-x 4 root 4096 Apr 14:52 jetty_0_0_0_0_50070_hdfs____w2cu08
drwxr-xr-x 4 root root 4096 Apr 10 14:5 2 Jetty_0_0_0_0_50090_secondary____y6aanv
drwxr-xr-x 5 root root 4096 Apr 15:02 jetty_master_8088_cluster____ i4ls4w
The first three files are the files that hold the PID, and the three jetty_xxx format directory is a temporary directory of Hadoop's Web application, not our concern.
Open a PID file to see:
[Root@master tmp]# Cat hadoop-root-namenode.pid32169
Quite simply, the PID of the Namenode process is saved, and the PID is read from this file when the Namenode process is closed.
The problem has been settled happily here.
However, knowing that the PID file is not safe to put here, still do not modify it appears that I am too lazy.
To modify the PID file storage directory, simply add a single line of statements to the hadoop-daemon.sh script:
Hadoop_pid_dir=/root/hadoop/pid #第25行
Remember to turn off Hadoop before you modify it, or you can't close it after you've modified it. In the same way, you also need to modify the yarn-daemon.sh
Yarn_pid_dir=/root/hadoop/pid
Then execute start-dfs.sh \ start-yarn.sh start Hadoop. Then go to the/root/hadoop/pid directory to see:
[Root@master pid]# ll-rw-r--r--1 root 5 Apr 14:52 hadoop-root-namenode.pid-rw-r--r--1 root root 5 Apr 14:52 h adoop-root-secondarynamenode.pid-rw-r--r--1 root 5 Apr 15:02 yarn-root-resourcemanager.pid
Well, never again worry about the emergence of no XXX to stop warning. cleanup Policy for/tmp directory
In addition to the replacement of the PID file save path, I can not think of another solution, do not let the operating system to delete the stored in/tmp directory of the PID file is not OK. OK, let's take a look at how the operating system cleans the/tmp directory.
Before I met this problem, I did not mind the/tmp directory, the Niang, get the answer.
Let's take a look at an important order:
Tmpwatch
Tmpwatch instructions can delete unnecessary staging files, you can set the file extended time, the unit in hours.
Common parameters:-
m or –mtime according to the file changed time-C or –ctime according to the file change state time-M or –dirtime exclude a path-X or –exclude=path according to the folder changed time Pattern excludes the path under a rule
/tmp as a temporary folder, the system clears the directory every day by default. The system executes a/etc/cron.daily/tmpwatch this script every day through a timed task. The principle is to use the Tmpwatch directive, set the cleanup strategy. Let's take a look at the script content:
/etc/cron.daily/tmpwatch
#!/bin/sh
flags=-umc
/usr/sbin/tmpwatch "$flags"-x/tmp/. X11-unix-x/tmp/. Xim-unix \
-x/tmp/.font-unix-x/tmp/. Ice-unix-x/tmp/. Test-unix \
x '/tmp/hsperfdata_* ' 10d/tmp
/usr/sbin/tmpwatch ' $flags ' 30d/var/tmp for
D in/var/{cache/ Man,catman}/{cat?,x11r6/cat?,local/cat?}; Do
if [-D "$d"]; then
/usr/sbin/tmpwatch "$flags"-F 30d "$d"
Fidone
The 4~6 line of code is a statement that sets the cleanup policy for the/tmp directory. -X or-X is used to exclude files or directories that are not cleaned up, and 10d indicates that files that have not been accessed in the last 10 days are deleted (some may be 240 for 240 hours and 10 days).
OK, 10 days do not have to delete, Hadoop cluster run for dozens of days, of course, can not find the PID file.
Are you aware that the 6th line of code is excluded from the file. /tmp/hsperfdata_*, when we solve the problem above, the first time we view the/tmp directory there is a file that matches this pattern: Hsperfdata_root.
Then, want to not let the system delete PID file, than the durian to draw a scoop on the line. Add an exclusion condition to the Tmpwatch script:
-X '/tmp/*.pid '
However, since it is a temporary directory, the important files do not put this, or recommend the first solution.