The problem of a Go program system thread Skyrocket

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed. Recently fix a Go Program system thread number of the problem, the number of threads maintained at 2, 30,000, and sometimes even more, this situation obviously does not conform to the concurrency principle of go. The first discovery of the number of threads is huge because the program suddenly crash, because the program has set the maximum number of threads available, so the number of threads too much will be crash.

This app is really hot now, swarm,swarm. This program is built to connect tens of thousands of Docker daemon servers as a client, maintaining a long connection, and then periodically reading data from these Docker daemon. After the investigation, found that the number of long connections and the number of threads is basically consistent, see this phenomenon completely can not understand.

The initial suspicion is the DNS query, took the CGO mode, resulting in threads are constantly created, but constantly look at the code can be sure to build the IP, not the domain name, and finally in order to exclude the impact of DNS CGO query, I did the configuration forced to use a pure Go way to do DNS resolution, but did not reduce the number of threads 。 Forcing the use of a pure Go DNS resolver only requires setting the following environment variables.

export GODEBUG=netdns=go

Then, continue to CGO call this road troubleshooting, and then began to locate the code in the use of CGO method calls C code, etc., did not find any CGO call, the last forced to close cgo is still useless.

This time I also analyzed the pprof data, did not find any doubt. Pprof originally had a threadcreate, feel this tool is very useful, but do not know why I get out always only the number of threads, and there is no document to describe the stack information. If you know the right posture using the Pprof/threadcreate tool, please tell me, thank you.

There is nothing to do, try to use runtime. The stack () method prints out the call stack at the location where the Goroutine scheduler created the thread, but because the array used to store the stack escapes to the heap, this method is not successful, it should be my posture is not right. Finally, simply turn on the scheduler state information, you can see that many threads have been created, and that these threads are not idle, and that they are working, extremely incomprehensible, or unable to understand why so many threads are created.

To turn on the Scheduler state information method:

exportGODEBUG=scheddetail=1,schedtrace=1000

This will output the dispatch state to standard output every 11,000 milliseconds.

After colleagues reminded to use Pstack to see what each thread is doing, the results found that most of the threads are in the read system call, the thread stack is as follows:

This time again strace tracking this thread appears as follows:

Can see this thread for a long time blocked on the read call, of course, I do not believe their eyes, or to further verify that FD 16 is indeed a network connection, network IO in the world of Go is not able to be blocked, otherwise everything is not working. This time I seriously doubt the Strace tool tracking problem, and then add the log to the Go Net Library, confirm whether it will exit the read system call, recompile the Go and swarm, put on the line a run to find that the read system call does not return after the data is not, that is, blocked. Then look at a line stacks after adding the log:

Before and after the read system call to add Start/stop log information, you can clearly see the last read blocked. By this time, I was able to explain why the number of threads soared and the number of connections was basically the same, but the Go underlying net library did set the Nonblock attribute for each connection, and eventually became a block. This is either a kernel bug, or the connection property is finally destroyed; I'm sure I'd prefer to believe the latter possibility, and suddenly think of the custom Dial () method in the code that has the option to set TCP to connection.

// TCP_USER_TIMEOUT is a relatively new feature to detect dead peer from sender side.// Linux supports it since kernel 2.6.37. It's among Golang experimental under// golang.org/x/sys/unix but it doesn't support all Linux platforms yet.// we explicitly define it here until it becomes official in golang.// TODO: replace it with proper package when TCP_USER_TIMEOUT is supported in golang.const tcpUserTimeout = 0x12syscall.SetsockoptInt(int(f.Fd()), syscall.IPPROTO_TCP, tcpUserTimeout, msecs)

Remove this TCP option, recompile and put it on the line after a run, sure enough everything is right, the number of threads is also reduced to a normal dozens of.

The tcp_user_timeout option is introduced in the 2.6.37 kernel, perhaps the kernel that we use does not support this option causing a problem with mandatory settings, why setting up tcp_user_timeout causes connection to change from Nonblock to block, worth further scrutiny. Note: This conclusion is actually not exactly correct, updated as follows

The real reason

This is the code to set the Tcp_user_timeout option, before I call this function to comment, it is imprudent to think that set this option out of the problem, and finally found that actually led to block should be Conn. File () call. The underlying code is this:

File sets the underlying OS. File to blocking mode and returns a copy.//It's the caller ' s responsibility to close F when finished.//Closing C does n OT affect F, and closing F does not affect c.////the returned OS. File ' s file descriptor is different from the connection ' s.//attempting to change properties of the original using this Du plicate//may or may not have the desired effect.func (c *conn) File () (F *os. File, err error) {f, err = C.fd.dup () if err! = Nil {err = &operror{op: "File", Net:c.fd.net, Source:c.fd.laddr, Addr : C.fd.raddr, Err:err}}return}func (fd *netfd) DUP (f *os. File, err Error) {NS, err: = Dupcloseonexec (FD.SYSFD) if err! = Nil {return nil, err}//We want blocking mode for the new F D, hence the double negative.//This also puts the old FD into blocking mode, meaning that//I/O would block the thread ins Tead of letting us use the epoll server.//everything would still work, just with more threads.if err = syscall. Setnonblock (ns, false); Err! = Nil {returnNil, os. Newsyscallerror ("Setnonblock", err)}return OS. NewFile (UIntPtr (NS), Fd.name ()), nil}

Don't look at the code, just look at the notes and understand. So it's wrong to set the TCP option posture here.


A long time ago, I found the following comment in the Swarm code:

// Swarm runnable threads could be large when the number of nodes is large// or under request bursts. Most threads are occupied by network connections.// Increase max thread count from 10k default to 50k to accommodate it.constmaxThreadCountint=50*1000debug.SetMaxThreads(maxThreadCount)

It seems that there is a huge number of threads, but it is considered that a large number of connections and concurrent requests consume so many threads. From this point of view, there may not be a deep understanding of the concurrency principle of Go and the event drivers on Linux.

Troubleshooting This problem took me a lot of time, mainly the phenomenon of this problem and my own understanding of the "world" is completely inconsistent, I am very difficult to think of this is go bug, indeed is not, so many times do not know from where to start, can only let oneself constantly back to runtime In the code to find the details of the condition that the thread was created.



Some of the previously written blog posts: www.skoo.me

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.