Golang Custom transports and timeouts

Last Update:2015-04-16 Source: Internet

Author: User

Tags datadog aws cloudformation

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

This is a creation in Article, where the information may have evolved or changed.

This post is a post-mortem of a Go production system crash. We'll use the logs to drill down into the guts of Go's network source code to find a leak that affected a core part of our Stack at Timehop. The root cause is the use of custom transports without explicit timeouts for requests and DNS lookups, as well as using a n older version of glibc under Linux.

Tl;dr

Always use explicit timeouts-especially when using custom transports
If you ' re running under Linux, upgrade glibc to version 2.20

"Your Timehop Day was ready"

One of the core elements to the Timehop experience are the morning push notification you get on your phone:

Behind that notification is a complex set of server-side the tasks we call prepares. They ensure that everything are ready for go when you check the app in the morning.

For the sake in venturing too far outside the scope of this post, I'll skip a few details on what actually happens on These prepares; Here's what matters:

At the time of writing, we run over one million prepares a day
Each prepare takes, on average, around 300ms
Each prepare is highly concurrent and network I/O intensive (what @kevrone wrote here should give you a hint of the scale of how much data we deal with)
Each machine runs prepares concurrently
We have 4 c3.large machines
We use DataDog as our primary monitoring tool for our AWS cloudformation stacks

Everything is fine (for a while)

As we went over ten million prepares a day, we started observing DataDog metrics dying off and while the prepare jobs were STI ll being executed-we keep an eye out of the queue sizes with a not-so-redundant, alternate method.

At first this is so infrequent, that we disregarded its importance; Maybe it is just something wrong with the chef recipe that restarted the DD agent or DD agent itself. Either, we ' d just kill that machine (which is effectively working, just not reporting DD stats) and Cloudformation wo Uld bring another one up. voilá, "fixed".

Then it started happening more frequently:

A sudden and rather severe memory leak-all the while, the prepares were still being performed.

I jumped on the DD logs:

2015-03-14 08:23:33 UTC | ERROR | Dd.collector |
  Collector (agent.py:354) | Uncaught Error running the Agent
Error: [Errno] Cannot allocate memory

Then onto our own Go logs:

Fatal error:out of Memory (Stackcacherefill)
Runtime stack:
Runtime.throw (0XBB7BC0)
  /usr/local/go/src/pkg/runtime/panic.c:520 +0x69
Stackcacherefill ()
  /usr/local/go/src/pkg/runtime/stack.c:52 +0x94

Oops.

Tracing the leak

As the Go program crashed, it dumped it runtime stack onto the Logs-all 4.5 million lines of it.

After spending a couple hours tracing and accounting for every goroutine and function call we had written, the only thing That really stood out is this:

Goroutine 281550311 [Semacquire, minutes]:
Sync.runtime_semacquire (0XC2267CADC0)
  /usr/local/go/src/pkg/runtime/sema.goc:199 +0x30
Sync. (*waitgroup). Wait (0XC2337C8F80)
  /usr/local/go/src/pkg/sync/waitgroup.go:129 +0x14b
Net. (*singleflight). Do (0XBBDC50, 0xc22f602e40, 0x20, ...)
  /usr/local/go/src/pkg/net/singleflight.go:37 +0x127
Net.lookupipmerge (0xc22f602e40, 0x20, 0x0, 0x0, 0x0, 0x0, 0x0)
  /usr/local/go/src/pkg/net/lookup.go:42 +0xae
Net.func 025 ()
  /usr/local/go/src/pkg/net/lookup.go:80 +0x3e
Created by Net.lookupipdeadline
  /usr/local/go/src/pkg/net/lookup.go:82 +0x2d8

The number of occurrences of this pattern, as well as the length for which the go routines had been stuck (~30 minutes) St Ruck me as odd, so I counted the occurrences ...

$ grep ' Net.lookupipdeadline ' Crash.log | Wc-l
420563

Woah. That's a lot of stuck lookups.

But they has a deadline and clean up after themselves, right?

Nope. Here ' s lookupipdeadline:

//Lookupipdeadline looks up a hostname with a deadline.
funcLookupipdeadline (Hoststring, Deadline time. Time)
(Addrs []ipaddr, errError) {
  ifDeadline. Iszero () {
    returnLookupipmerge (host)
}

  //We could push the deadline down into the name resolution
  //functions. However, the most commonly used implementation
  //Calls Getaddrinfo, which has no timeout.

Timeout:=Deadline. Sub (time. Now ())
  ifTimeout<=0 {
    return Nil, Errtimeout
}
T:=Time. Newtimer (Timeout)
  deferT.stop ()

Ch:=Lookupgroup.dochan (Host,func() (Interface{},Error) {
    returnLOOKUPIP (host)
})

  Select{
   Case <-T.C:
    //The DNS lookup timed out for some reason. Force
    //Future requests to start the DNS lookup again
    //Rather than waiting for the current lookup to
    //complete. See issue 8602.
Lookupgroup.forget (host)

    return Nil, Errtimeout

   CaseR:= <-Ch:
    returnLookupipreturn (R.V, R.err, r.shared)
}
}

A few very interesting things:

If no deadline is specified, there'll be are no guarantee of this function ever returning
When a deadline was specified, the timeout is a Go-timeout, not a-lower (OS) level timeout
That Issue #8602 mention, which we'll get back to later on

I was kind of baffled there wasn ' t all sort of lower level timeout while performing a DNS lookup ... So, following through to Lookupipmerge:

func string error) {
  :=
    func () (interfaceerror) {
      return Lookupip (host)
    })
  // ...

Now the LOOKUPIP function is platform-dependent. In this case it's defined in Lookup_unix.go (the binary is running on a Linux machine.)

func string error) {
  : = Cgolookupip (host)
  if !ok {
    Addrs, err = GOLOOKUPIP (host)
  }
  Return
}

Cgolookupip (which defers to cgolookupipcname) are defined in Cgo_unix.go

Nowhere along this path would you see a possible timeout being raised from below. Looking up the documentation of GETADDRINFO, there is ' t a single mention of ' timeout '.

Basically, if the LOOKUPIP (host) call running inside that Lookupgroup ' s Dochan () Hangs-which apparently it can since thi S is called-, it's ' ll take a Go routine with it. Forever. Bye.

The only-to-truly fix this is-to-push the timeouts down the stack. I went back to Issue #8602 and on the fix commit, noticed this TODO:

TODO (Bradfitz): Consider pushing the deadline down into the
Name resolution functions. But this involves fixing it for
The native Go resolver, CGO, Windows, etc.

And how it got replaced by this:

We could push the deadline down into the name resolution
Functions. However, the most commonly used implementation
Calls Getaddrinfo, which has no timeout.

You read this right:

getaddrinfo, which has no timeout.

At the end of the day, the only thing preventing a real leak is the implementation of getaddrinfo have some sort of hard Limit on DNS lookup timeouts and eventually returning.

While this struck me as a severe oversight at first, understanding that there are no portable cross-platform solution to PU Shing the timeouts down the stacks made me realize that I ' d probably also end up erring on the side of trusting the underly ing implementation to return in a timely fashion. It's either or re-invent the wheel and end up writing a resolver.

In we case, this is a bug with GLIBC's getaddrinfo, fixed in v2.20, so the only real solution is to update glibc.

To check which version of glibc your ' re running on your (Linux) system:

$ LDD--version
(Ubuntu eglibc 2.19-0ubuntu6) 2.19

Now that the real probleam is solved, it is time to go back and make things a little better ...

Use explicit timeouts across the board

When you do something ...

:= &http. transport{}

... wherever this transport ends up being used, it could hang for an indeterminate period of time performing lookups and/or TL S handshake.

For unencrypted connections, the Transport instance would use whichever function are assigned to the Dial field to establish The connection. If no function is assigned, it's ll default to net. Dial, which creates a temporary Dialer that may or may not be a timeout-keyword here are May.

And since you ll highly likely is using this transport with an HTTP. Client, I ' d also recommend setting a cap for the Timeout field. Just keep in mind that's a hard global timeout on the entire request cycle-dial, send the request and read the response .

I traced the DNS lookups back up to our code and found this:

:= &http. client{
  &http. transport{},
}

There it is, that's second line. The Quick & dirty fix would is to simply replace &http. transport{} with &http. Defaulttransport but the want to consider something a bit more explicit and/or aggressive for production systems:

:= //Rather aggressive
:= &http. client{
  &http. transport{
    Proxy:http. Proxyfromenvironment,
    Dial: (&net. dialer{
      Timeout:   * time. Second,
      * time. Second,
    }). Dial,
    * time. Second,
  },
}

Conclusion

use explicit timeouts Everywhere-take the driver's s Eat
http. Defaultclient is relatively safe since it uses HTTP. Defaulttransport, which has dial timeout set (it won ' t having request timeouts though, so thread carefully)
make sure your HTTP. Client ' s Dial function has timeouts and set tlshandshaketimeout
update to glibc 2.20 or higher

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More