Let's talk about the usage scenario first, a service starts an agent on each server, and the user will execute the task on the specified machine and return the result to the web page. The execution of tasks is user-defined scripts, which are usually shells or python, and will continue to spawn child processes and grandchildren, until execution is completed or it is killed after a timeout.
Recently I found that there are often tasks that have been running, but in fact they have been killed after a timeout, and the output is not written to the system, and the execution of the task is not visible. Log in to the machine and find that the script execution process has been killed, but a subscript is stuck at a certain http call. Our company's network is too heavy to be seen directly, there are still cases that are not accessible on the internal network, and there are many public network machines. Let's look at this script. By default, python requests have no timeout ...
To summarize the phenomenon: the agent uses go cmd to start the child process, and the child process will also start the grandchild process, which is blocked for some reason. At this time, if the child process is killed by the agent kill due to timeout, the agent is still in the wait state.
Reproduce case
The agent uses exec.CommandContext to start the task, sets the ctx timeout to 30s, writes the result to bytes.Buffer, and prints it. A simplified example is as follows:
func main () {
ctx, cancelFn: = context.WithTimeout (context.Background (), time.Second * 30)
defer cancelFn ()
cmd: = exec.CommandContext (ctx, "/root/dongzerun/sleep.sh")
var b bytes.Buffer
cmd.Stdout = & b // Spoiler, the pit is here
cmd.Stderr = & b
cmd.Start ()
cmd.Wait ()
fmt.Println ("recive:", b.String ())
}
This is sleep.sh, which simulates a child process
#! / bin / sh
echo "in sleep"
sh /root/dongzerun/sleep1.sh
This is sleep1.sh simulation grandchild, sleep 1000 is blocked here
#! / bin / sh
sleep 1000
phenomenon
Start the test go program and see ps axjf | less
ppid pid pgid
2468 32690 32690 32690? -1 Ss 0 0:00 \ _ sshd: root @ pts / 6
32690 32818 32818 32818 pts / 6 28746 Ss 0 0:00 | \ _ -bash
32818 28531 28531 32818 pts / 6 28746 S 0 0:00 | \ _ strace ./wait
28531 28543 28531 32818 pts / 6 28746 Sl 0 0:00 | | \ _ ./wait
28543 28559 28531 32818 pts / 6 28746 S 0 0:00 | | \ _ / bin / sh /root/dongzerun/sleep.sh
28559 28560 28531 32818 pts / 6 28746 S 0 0:00 | | \ _ sh /root/dongzerun/sleep1.sh
28560 28563 28531 32818 pts / 6 28746 S 0 0:00 | | \ _ sleep 1000
Waited for 30s, see through ps axjf | less
2468 32690 32690 32690? -1 Ss 0 0:00 \ _ sshd: root @ pts / 6
32690 32818 32818 32818 pts / 6 36192 Ss 0 0:00 | \ _ -bash
32818 28531 28531 32818 pts / 6 36192 S 0 0:00 | \ _ strace ./wait
28531 28543 28531 32818 pts / 6 36192 Sl 0 0:00 | | \ _ ./wait
1 28560 28531 32818 pts / 6 36192 S 0 0:00 sh /root/dongzerun/sleep1.sh
28560 28563 28531 32818 pts / 6 36192 S 0 0:00 \ _ sleep 1000
Through the above case, you can see that sleep1.sh becomes an orphan process and is claimed by init 1, but 28543 wait does not exit, so what is he doing? ? ?
analysis
Using strace to view wait programs
epoll_ctl (4, EPOLL_CTL_DEL, 6, {0, {u32 = 0, u64 = 0}}) = 0
close (6) = 0
futex (0xc420054938, FUTEX_WAKE, 1) = 1
waitid (P_PID, 28559, {si_signo = SIGCHLD, si_code = CLD_KILLED, si_pid = 28559, si_status = SIGKILL, si_utime = 0, si_stime = 0}, WEXITEDWNOWAIT, NULL) = 0
Stuck here for about 30s
--- SIGCHLD {si_signo = SIGCHLD, si_code = CLD_KILLED, si_pid = 28559, si_status = SIGKILL, si_utime = 0, si_stime = 0} ---
rt_sigreturn () = 0
futex (0x9a0378, FUTEX_WAKE, 1) = 1
futex (0x9a02b0, FUTEX_WAKE, 1) = 1
wait4 (28559, [{WIFSIGNALED (s) && WTERMSIG (s) == SIGKILL}], 0, {ru_utime = {0, 0}, ru_stime = {0, 0}, ...}) = 28559
futex (0x9a0b78, FUTEX_WAIT, 0, NULL
From the go source code, you can see that when go exec wait, it will first execute waitid, block here, and then wait4 again to wait for the final exit result. Don't quite understand why wait twice ... But in the end stuck on futex, what resources are looking like waiting? ? ?
Open golang pprof, run the program again, and pprof
go func () {
err: = http.ListenAndServe (": 6060", nil)
if err! = nil {
fmt.Printf ("failed to start pprof monitor:% s", err)
}
} ()
curl http://127.0.0.1:6060/debug/pprof/goroutine?debug=2
goroutine 1 [chan receive]:
os / exec. (* Cmd) .Wait (0xc42017a000, 0x7c3d40, 0x0)
/usr/local/go/src/os/exec/exec.go:454 + 0x135
main.main ()
/root/dongzerun/wait.go:32 + 0x167
The program didn't exit, and it was inconceivably stuck at exec.go: 454 lines of code. Check the source code of go1.9.6:
// Wait releases any resources associated with the Cmd.
func (c * Cmd) Wait () error {
...
state, err: = c.Process.Wait ()
if c.waitDone! = nil {
close (c.waitDone)
}
c.ProcessState = state
var copyError error
for range c.goroutine {
// stuck here
if err: = <-c.errch; err! = nil && copyError == nil {
copyError = err
}
}
c.closeDescriptors (c.closeAfterWait)
...
return copyError
}
Through source code analysis, the program wait stuck at <-c.errch to get chan data. So how is errch generated?
Looking at the cmd.Start source code, go organizes cmd.Stdin, cmd.Stdout, cmd.Stderr into * os.File, and writes them to the array childFiles in turn. This array index corresponds to the 0, 1, and 2 text description of the child process. Character, that is, the standard input, output, and error of the child process.
type F func (* Cmd) (* os.File, error)
for _, setupFd: = range [] F ((* Cmd) .stdin, (* Cmd) .stdout, (* Cmd) .stderr) {
fd, err: = setupFd (c)
if err! = nil {
c.closeDescriptors (c.closeAfterStart)
c.closeDescriptors (c.closeAfterWait)
return err
}
c.childFiles = append (c.childFiles, fd)
}
c.childFiles = append (c.childFiles, c.ExtraFiles ...)
var err error
c.Process, err = os.StartProcess (c.Path, c.argv (), & os.ProcAttr {
Dir: c. Dir,
Files: c.childFiles,
Env: dedupEnv (c.envv ()),
Sys: c.SysProcAttr,
})
When executing setupFd, there will be a key operation, open the pipe pipe, and encapsulate an anonymous func. The function is to write the output of the child process to the pipe or the pipe data to the child process standard input, and finally close the pipe. This anonymous Function finally executes at Start
c.closeAfterStart = append (c.closeAfterStart, pr)
c.closeAfterWait = append (c.closeAfterWait, pw)
c.goroutine = append (c.goroutine, func () error {
_, err: = io.Copy (pw, c.Stdin)
if skip: = skipStdinCopyError; skip! = nil && skip (err) {
err = nil
}
if err1: = pw.Close (); err == nil {
err = err1
}
return err
})
return pr, nil
}
Re-run the test case and use lsof to see which resources the process has opened
root @ nb1963: ~ / dongzerun # ps aux grep wait
root 4531 0.0 0.0 122 180 6520 pts / 6 Sl 17:24 0:00 ./wait
root 4726 0.0 0.0 10484 2144 pts / 6 S + 17:24 0:00 grep --color = auto wait
root @ nb1963: ~ / dongzerun #
root @ nb1963: ~ / dongzerun # ps aux | grep sleep
root 4543 0.0 0.0 4456 688 pts / 6 S 17:24 0:00 / bin / sh /root/dongzerun/sleep.sh
root 4548 0.0 0.0 4456 760 pts / 6 S 17:24 0:00 sh /root/dongzerun/sleep1.sh
root 4550 0.0 0.0 5928 748 pts / 6 S 17:24 0:00 sleep 1000
root 4784 0.0 0.0 10480 2188 pts / 6 S + 17:24 0:00 grep --color = auto sleep
root @ nb1963: ~ / dongzerun #
root @ nb1963: ~ / dongzerun # lsof -p 4531
COMMAND PID USER FD TYPE DEVICE SIZE / OFF NODE NAME
wait 4531 root 0w CHR 1,3 0t0 1029 / dev / null
wait 4531 root 1w REG 8,1 94371 4991345 /root/dongzerun/nohup.out
wait 4531 root 2w REG 8,1 94371 4991345 /root/dongzerun/nohup.out
wait 4531 root 3u IPv6 2005568215 0t0 TCP *: 6060 (LISTEN)
wait 4531 root 4u 0000 0,10 0 9076 anon_inode
wait 4531 root 5r FIFO 0,9 0t0 2005473170 pipe
root @ nb1963: ~ / dongzerun # lsof -p 4543
COMMAND PID USER FD TYPE DEVICE SIZE / OFF NODE NAME
sleep.sh 4543 root 0r CHR 1,3 0t0 1029 / dev / null
sleep.sh 4543 root 1w FIFO 0,9 0t0 2005473170 pipe
sleep.sh 4543 root 2w FIFO 0,9 0t0 2005473170 pipe
sleep.sh 4543 root 10r REG 8,1 55 4993949 /root/dongzerun/sleep.sh
root @ nb1963: ~ / dongzerun # lsof -p 4550
COMMAND PID USER FD TYPE DEVICE SIZE / OFF NODE NAME
sleep 4550 root mem REG 8,1 1607664 9179617 / usr / lib / locale / locale-archive
sleep 4550 root 0r CHR 1,3 0t0 1029 / dev / null
sleep 4550 root 1w FIFO 0,9 0t0 2005473170 pipe
sleep 4550 root 2w FIFO 0,9 0t0 2005473170 pipe
Summary of reasons
The grandchild process, by default, inherits the file descriptor opened by the parent process, that is, the pipe of node 2005473170. When the parent process is killed -9, it will clean up the resources and close the open file, but close just decrements the reference count by 1. In fact, the Sun process still has the pipe open. Looking back at the agent code
c.goroutine = append (c.goroutine, func () error {
_, err: = io.Copy (pw, c.Stdin)
if skip: = skipStdinCopyError; skip! = nil && skip (err) {
err = nil
}
if err1: = pw.Close (); err == nil {
err = err1
}
return err
})
Then after the execution of the child process is completed, go cmd executes the anonymous function's io.Copy to read the child process output data. There is never data to read and there is no timeout. Blocked here.
solution
When the cause is found, the solution is also available.
When the child process starts the grandchild process, the CloseOnEec flag is added, but it is not realistic. It depends on the output log of the grandchild
io.Copy rewrite, increase timeout call, theoretically feasible, but to change the source
When timeout kill, not only kill the child process, but kill the process group
In the end, the solution 3 was adopted. The simplified code is as follows. There are two main changes:
SysProcAttr configures Setpgid so that child and grandchildren have independent process group ids, that is, pids of child processes
syscall.Kill (-cmd.Process.Pid, syscall.SIGKILL) Specify the process group when killing a process
func Run (instance string, env map [string] string) bool {
var (
cmd * exec.Cmd
proc * Process
sysProcAttr * syscall.SysProcAttr
)
t: = time.Now ()
sysProcAttr = & syscall.SysProcAttr {
Setpgid: true, // make the child process have its own pgid, which is equivalent to the child process's pid
Credential: & syscall.Credential {
Uid: uint32 (uid),
Gid: uint32 (gid),
},
}
waitChan: = make (chan struct (), 1)
defer close (waitChan)
// Kill the process group over timeout or exit normally
go func () {
select {
case <-ctx.Done ():
log.Warnf ("timeout kill job% s-% s% s ppid:% d", j.Group, j.ID, j.Name, cmd.Process.Pid)
syscall.Kill (-cmd.Process.Pid, syscall.SIGKILL)
case <-waitChan:
}
} ()
The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion;
products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the
content of the page makes you feel confusing, please write us an email, we will handle the problem
within 5 days after receiving your email.
If you find any instances of plagiarism from the community, please send an email to:
info-contact@alibabacloud.com
and provide relevant evidence. A staff member will contact you within 5 working days.