The problem of a Go program system thread Skyrocket

Last Update:2016-09-19 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a creation in Article, where the information may have evolved or changed. Recently fix a Go Program system thread number of the problem, the number of threads maintained at 2, 30,000, and sometimes even more, this situation obviously does not conform to the concurrency principle of go. The first discovery of the number of threads is huge because the program suddenly crash, because the program has set the maximum number of threads available, so the number of threads too much will be crash.

This app is really hot now, swarm,swarm. This program is built to connect tens of thousands of Docker daemon servers as a client, maintaining a long connection, and then periodically reading data from these Docker daemon. After the investigation, found that the number of long connections and the number of threads is basically consistent, see this phenomenon completely can not understand.

The initial suspicion is the DNS query, took the CGO mode, resulting in threads are constantly created, but constantly look at the code can be sure to build the IP, not the domain name, and finally in order to exclude the impact of DNS CGO query, I did the configuration forced to use a pure Go way to do DNS resolution, but did not reduce the number of threads 。 Forcing the use of a pure Go DNS resolver only requires setting the following environment variables.

export GODEBUG=netdns=go

Then, continue to CGO call this road troubleshooting, and then began to locate the code in the use of CGO method calls C code, etc., did not find any CGO call, the last forced to close cgo is still useless.

This time I also analyzed the pprof data, did not find any doubt. Pprof originally had a threadcreate, feel this tool is very useful, but do not know why I get out always only the number of threads, and there is no document to describe the stack information. If you know the right posture using the Pprof/threadcreate tool, please tell me, thank you.

There is nothing to do, try to use runtime. The stack () method prints out the call stack at the location where the Goroutine scheduler created the thread, but because the array used to store the stack escapes to the heap, this method is not successful, it should be my posture is not right. Finally, simply turn on the scheduler state information, you can see that many threads have been created, and that these threads are not idle, and that they are working, extremely incomprehensible, or unable to understand why so many threads are created.

To turn on the Scheduler state information method:

exportGODEBUG=scheddetail=1,schedtrace=1000

This will output the dispatch state to standard output every 11,000 milliseconds.

After colleagues reminded to use Pstack to see what each thread is doing, the results found that most of the threads are in the read system call, the thread stack is as follows:

This time again strace tracking this thread appears as follows:

Can see this thread for a long time blocked on the read call, of course, I do not believe their eyes, or to further verify that FD 16 is indeed a network connection, network IO in the world of Go is not able to be blocked, otherwise everything is not working. This time I seriously doubt the Strace tool tracking problem, and then add the log to the Go Net Library, confirm whether it will exit the read system call, recompile the Go and swarm, put on the line a run to find that the read system call does not return after the data is not, that is, blocked. Then look at a line stacks after adding the log:

Before and after the read system call to add Start/stop log information, you can clearly see the last read blocked. By this time, I was able to explain why the number of threads soared and the number of connections was basically the same, but the Go underlying net library did set the Nonblock attribute for each connection, and eventually became a block. This is either a kernel bug, or the connection property is finally destroyed; I'm sure I'd prefer to believe the latter possibility, and suddenly think of the custom Dial () method in the code that has the option to set TCP to connection.

// TCP_USER_TIMEOUT is a relatively new feature to detect dead peer from sender side.// Linux supports it since kernel 2.6.37. It's among Golang experimental under// golang.org/x/sys/unix but it doesn't support all Linux platforms yet.// we explicitly define it here until it becomes official in golang.// TODO: replace it with proper package when TCP_USER_TIMEOUT is supported in golang.const tcpUserTimeout = 0x12syscall.SetsockoptInt(int(f.Fd()), syscall.IPPROTO_TCP, tcpUserTimeout, msecs)

Remove this TCP option, recompile and put it on the line after a run, sure enough everything is right, the number of threads is also reduced to a normal dozens of.

The tcp_user_timeout option is introduced in the 2.6.37 kernel, perhaps the kernel that we use does not support this option causing a problem with mandatory settings, why setting up tcp_user_timeout causes connection to change from Nonblock to block, worth further scrutiny. 【 Note: This conclusion is actually not exactly correct, updated as follows】

The real reason

This is the code to set the Tcp_user_timeout option, before I call this function to comment, it is imprudent to think that set this option out of the problem, and finally found that actually led to block should be Conn. File () call. The underlying code is this:

File sets the underlying OS. File to blocking mode and returns a copy.//It's the caller ' s responsibility to close F when finished.//Closing C does n OT affect F, and closing F does not affect c.////the returned OS. File ' s file descriptor is different from the connection ' s.//attempting to change properties of the original using this Du plicate//may or may not have the desired effect.func (c *conn) File () (F *os. File, err error) {f, err = C.fd.dup () if err! = Nil {err = &operror{op: "File", Net:c.fd.net, Source:c.fd.laddr, Addr : C.fd.raddr, Err:err}}return}func (fd *netfd) DUP (f *os. File, err Error) {NS, err: = Dupcloseonexec (FD.SYSFD) if err! = Nil {return nil, err}//We want blocking mode for the new F D, hence the double negative.//This also puts the old FD into blocking mode, meaning that//I/O would block the thread ins Tead of letting us use the epoll server.//everything would still work, just with more threads.if err = syscall. Setnonblock (ns, false); Err! = Nil {returnNil, os. Newsyscallerror ("Setnonblock", err)}return OS. NewFile (UIntPtr (NS), Fd.name ()), nil}

Don't look at the code, just look at the notes and understand. So it's wrong to set the TCP option posture here.

A long time ago, I found the following comment in the Swarm code:

// Swarm runnable threads could be large when the number of nodes is large// or under request bursts. Most threads are occupied by network connections.// Increase max thread count from 10k default to 50k to accommodate it.constmaxThreadCountint=50*1000debug.SetMaxThreads(maxThreadCount)

It seems that there is a huge number of threads, but it is considered that a large number of connections and concurrent requests consume so many threads. From this point of view, there may not be a deep understanding of the concurrency principle of Go and the event drivers on Linux.

Troubleshooting This problem took me a lot of time, mainly the phenomenon of this problem and my own understanding of the "world" is completely inconsistent, I am very difficult to think of this is go bug, indeed is not, so many times do not know from where to start, can only let oneself constantly back to runtime In the code to find the details of the condition that the thread was created.

Some of the previously written blog posts: www.skoo.me

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

The problem of a Go program system thread Skyrocket

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

The problem of a Go program system thread Skyrocket

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support