This is a creation in Article, where the information may have evolved or changed.
This post is a post-mortem of a Go production system crash. We'll use the logs to drill down into the guts of Go's network source code to find a leak that affected a core part of our Stack at Timehop. The root cause is the use of custom transports without explicit timeouts for requests and DNS lookups, as well as using a n older version of glibc under Linux.
Tl;dr
- Always use explicit timeouts-especially when using custom transports
- If you ' re running under Linux, upgrade glibc to version 2.20
"Your Timehop Day was ready"
One of the core elements to the Timehop experience are the morning push notification you get on your phone:
Behind that notification is a complex set of server-side the tasks we call prepares. They ensure that everything are ready for go when you check the app in the morning.
For the sake in venturing too far outside the scope of this post, I'll skip a few details on what actually happens on These prepares; Here's what matters:
- At the time of writing, we run over one million prepares a day
- Each prepare takes, on average, around 300ms
- Each prepare is highly concurrent and network I/O intensive (what @kevrone wrote here should give you a hint of the scale of how much data we deal with)
- Each machine runs prepares concurrently
- We have 4 c3.large machines
- We use DataDog as our primary monitoring tool for our AWS cloudformation stacks
Everything is fine (for a while)
As we went over ten million prepares a day, we started observing DataDog metrics dying off and while the prepare jobs were STI ll being executed-we keep an eye out of the queue sizes with a not-so-redundant, alternate method.
At first this is so infrequent, that we disregarded its importance; Maybe it is just something wrong with the chef recipe that restarted the DD agent or DD agent itself. Either, we ' d just kill that machine (which is effectively working, just not reporting DD stats) and Cloudformation wo Uld bring another one up. voilá, "fixed".
Then it started happening more frequently:
A sudden and rather severe memory leak-all the while, the prepares were still being performed.
I jumped on the DD logs:
2015-03-14 08:23:33 UTC | ERROR | Dd.collector |
Collector (agent.py:354) | Uncaught Error running the Agent
Error: [Errno] Cannot allocate memory
Then onto our own Go logs:
Fatal error:out of Memory (Stackcacherefill)
Runtime stack:
Runtime.throw (0XBB7BC0)
/usr/local/go/src/pkg/runtime/panic.c:520 +0x69
Stackcacherefill ()
/usr/local/go/src/pkg/runtime/stack.c:52 +0x94
Oops.
Tracing the leak
As the Go program crashed, it dumped it runtime stack onto the Logs-all 4.5 million lines of it.
After spending a couple hours tracing and accounting for every goroutine and function call we had written, the only thing That really stood out is this:
Goroutine 281550311 [Semacquire, minutes]:
Sync.runtime_semacquire (0XC2267CADC0)
/usr/local/go/src/pkg/runtime/sema.goc:199 +0x30
Sync. (*waitgroup). Wait (0XC2337C8F80)
/usr/local/go/src/pkg/sync/waitgroup.go:129 +0x14b
Net. (*singleflight). Do (0XBBDC50, 0xc22f602e40, 0x20, ...)
/usr/local/go/src/pkg/net/singleflight.go:37 +0x127
Net.lookupipmerge (0xc22f602e40, 0x20, 0x0, 0x0, 0x0, 0x0, 0x0)
/usr/local/go/src/pkg/net/lookup.go:42 +0xae
Net.func 025 ()
/usr/local/go/src/pkg/net/lookup.go:80 +0x3e
Created by Net.lookupipdeadline
/usr/local/go/src/pkg/net/lookup.go:82 +0x2d8
The number of occurrences of this pattern, as well as the length for which the go routines had been stuck (~30 minutes) St Ruck me as odd, so I counted the occurrences ...
$ grep ' Net.lookupipdeadline ' Crash.log | Wc-l
420563
Woah. That's a lot of stuck lookups.
But they has a deadline and clean up after themselves, right?
Nope. Here ' s lookupipdeadline:
//Lookupipdeadline looks up a hostname with a deadline.
funcLookupipdeadline (Hoststring, Deadline time. Time)
(Addrs []ipaddr, errError) {
ifDeadline. Iszero () {
returnLookupipmerge (host)
}
//We could push the deadline down into the name resolution
//functions. However, the most commonly used implementation
//Calls Getaddrinfo, which has no timeout.
Timeout:=Deadline. Sub (time. Now ())
ifTimeout<=0 {
return Nil, Errtimeout
}
T:=Time. Newtimer (Timeout)
deferT.stop ()
Ch:=Lookupgroup.dochan (Host,func() (Interface{},Error) {
returnLOOKUPIP (host)
})
Select{
Case <-T.C:
//The DNS lookup timed out for some reason. Force
//Future requests to start the DNS lookup again
//Rather than waiting for the current lookup to
//complete. See issue 8602.
Lookupgroup.forget (host)
return Nil, Errtimeout
CaseR:= <-Ch:
returnLookupipreturn (R.V, R.err, r.shared)
}
}
A few very interesting things:
- If no deadline is specified, there'll be are no guarantee of this function ever returning
- When a deadline was specified, the timeout is a Go-timeout, not a-lower (OS) level timeout
- That Issue #8602 mention, which we'll get back to later on
I was kind of baffled there wasn ' t all sort of lower level timeout while performing a DNS lookup ... So, following through to Lookupipmerge:
func string error) {
:=
func () (interfaceerror) {
return Lookupip (host)
})
// ...
Now the LOOKUPIP function is platform-dependent. In this case it's defined in Lookup_unix.go (the binary is running on a Linux machine.)
func string error) {
: = Cgolookupip (host)
if !ok {
Addrs, err = GOLOOKUPIP (host)
}
Return
}
Cgolookupip (which defers to cgolookupipcname) are defined in Cgo_unix.go
Nowhere along this path would you see a possible timeout being raised from below. Looking up the documentation of GETADDRINFO, there is ' t a single mention of ' timeout '.
Basically, if the LOOKUPIP (host) call running inside that Lookupgroup ' s Dochan () Hangs-which apparently it can since thi S is called-, it's ' ll take a Go routine with it. Forever. Bye.
The only-to-truly fix this is-to-push the timeouts down the stack. I went back to Issue #8602 and on the fix commit, noticed this TODO:
TODO (Bradfitz): Consider pushing the deadline down into the
Name resolution functions. But this involves fixing it for
The native Go resolver, CGO, Windows, etc.
And how it got replaced by this:
We could push the deadline down into the name resolution
Functions. However, the most commonly used implementation
Calls Getaddrinfo, which has no timeout.
You read this right:
getaddrinfo,
which has no timeout.
At the end of the day, the only thing preventing a real leak is the implementation of getaddrinfo have some sort of hard Limit on DNS lookup timeouts and eventually returning.
While this struck me as a severe oversight at first, understanding that there are no portable cross-platform solution to PU Shing the timeouts down the stacks made me realize that I ' d probably also end up erring on the side of trusting the underly ing implementation to return in a timely fashion. It's either or re-invent the wheel and end up writing a resolver.
In we case, this is a bug with GLIBC's getaddrinfo, fixed in v2.20, so the only real solution is to update glibc.
To check which version of glibc your ' re running on your (Linux) system:
$ LDD--version
(Ubuntu eglibc 2.19-0ubuntu6) 2.19
Now that the real probleam is solved, it is time to go back and make things a little better ...
Use explicit timeouts across the board
When you do something ...
:= &http. transport{}
... wherever this transport ends up being used, it could hang for an indeterminate period of time performing lookups and/or TL S handshake.
For unencrypted connections, the Transport instance would use whichever function are assigned to the Dial field to establish The connection. If no function is assigned, it's ll default to net. Dial, which creates a temporary Dialer that may or may not be a timeout-keyword here are May.
And since you ll highly likely is using this transport with an HTTP. Client, I ' d also recommend setting a cap for the Timeout field. Just keep in mind that's a hard global timeout on the entire request cycle-dial, send the request and read the response .
I traced the DNS lookups back up to our code and found this:
:= &http. client{
&http. transport{},
}
There it is, that's second line. The Quick & dirty fix would is to simply replace &http. transport{} with &http. Defaulttransport but the want to consider something a bit more explicit and/or aggressive for production systems:
:= //Rather aggressive
:= &http. client{
&http. transport{
Proxy:http. Proxyfromenvironment,
Dial: (&net. dialer{
Timeout: * time. Second,
* time. Second,
}). Dial,
* time. Second,
},
}
Conclusion
- use explicit timeouts Everywhere-take the driver's s Eat
- http. Defaultclient is relatively safe since it uses HTTP. Defaulttransport, which has dial timeout set (it won ' t having request timeouts though, so thread carefully)
- make sure your HTTP. Client ' s Dial function has timeouts and set tlshandshaketimeout
- update to glibc 2.20 or higher