This is a creation in Article, where the information may have evolved or changed.
In the use of Goreman management of the entire TIDB cluster, I found that the Goreman run stop function does not normally kill the TIDB components, at first I think it is our own code does not have a good deal of related signals, but later found that the TIDB side is not received at all, so The problem is in other places, so browse under Goreman source code, found that the /bin/sh -c way to start the TIDB program, may be related to this.
First of all, write two simple programs, one is to start the process directly, and the other is to use /bin/sh -c the way to start. For simplicity's sake, we sleep for a long time, then kill the process after 10 seconds.
func child() { cmd := exec.Command("sleep", "600") start := time.Now() time.AfterFunc(10*time.Second, func() { cmd.Process.Kill() }) ecmd.Run()}func grand_child() { cmd := exec.Command("/bin/sh", "-c", "sleep 1000") time.AfterFunc(10*time.Second, func() { cmd.Process.Kill() }) cmd.Run()}
After starting, we will find three sleep processes, 31126 of which are 31124 child processes.
PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND31119 31124 31119 30890 pts/0 31119 S+ 1000 0:00 /bin/sh -c sleep 100031119 31125 31119 30890 pts/0 31119 S+ 1000 0:00 sleep 60031124 31126 31119 30890 pts/0 31119 S+ 1000 0:00 sleep 1000
After 10 seconds, we found that the 31126 was still there and was not killed, that is, the direct kill 31124 was not able to kill its child process. And 31126 of the parent process has now become 1, that is, 31126 becomes the orphan process, and is then taken over by the INIT process.
PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND 1 31126 31119 30890 pts/0 30890 S 1000 0:00 sleep 1000
To solve this problem, what we need is the kill process group, not just kill the parent process, the way it is used in Linux kill -- -PID . At the same time in order to avoid suicide, we need to be for /bin/sh this side extra in creating a new process group (above we can note that all Pgid are 31119, if directly kill -- -PID , will kill all related processes).
In Go, we use Setpgid: true to display the creation of a new process group, as follows:
func grand_child() { cmd := exec.Command("/bin/sh", "-c", "sleep 1000") cmd.SysProcAttr = &syscall.SysProcAttr{Setpgid: true} time.AfterFunc(10*time.Second, func() { syscall.Kill(-cmd.Process.Pid, syscall.SIGKILL) }) cmd.Run()}
Once started, you will find that the /bin/sh process you started has already used a new process group:
PPID PID PGID SID TTY TPGID STAT UID TIME COMMAND4517 4522 4522 3374 pts/0 4517 S 1000 0:00 /bin/sh -c sleep 10004522 4524 4522 3374 pts/0 4517 S 1000 0:00 sleep 1000
Then 10 seconds later, all the related processes were killed.
Now that you understand the method of kill grand process, it's easier to look at Goreman's problems. In the Goreman inside, actually also has the kill process group code, but, it misses the most critical Setpgid: true , to mention a PR, to finish the work.