Go Performance Tales

Source: Internet
Author: User
Tags datadog
This is a creation in Article, where the information may have evolved or changed.

Go Performance Tales

This entry was cross-posted on the Datadog blog. If you want to learn more about Datadog or what we deal with the mountain of data we receive, check it out!

The last few months I ' ve had the pleasure of working on a new bit of intake processing at Datadog. It was my first production service written in Go, and I wanted to nail the performance of a few vital consumer, processin g, and scheduling idioms that would form the basis for future projects. I wrote a lot Ofbenchmarks and spent a lot of time examining profiles output, learning new things about Go, and relearning Old things about programming. Althoughintuition can is a flawed approach to achieving good performance, learning What you get certain behaviors U Sually proves valuable. I wanted to share a few of the things I ' ve learned.

Use integer map keys if possible

Our new service is designed to manage indexes which track how recently metrics, hosts, and tags has been used by a Custo Mer. These indexes is used on the front-end for overview pages and auto-completion. By taking this burden off of the main intake processor, we could free it's other tasks, and add more indexes to S Peed up and other parts of the site.

This stateful processor would keep a history of all the metrics we ' ve seen recently. If a data point coming off the queue is not in the history, it ' d is flushed to the indexes quickly to ensure that new hos TS and metrics would appear on site as soon as possible. If It is in the history, then it is likely already in the indexes, and it could is put in a cache to be flushed much les s frequently. This approach would maintain low latency for new data points while drastically reducing the number of duplicate writes.

We started out using A map[string]struct{}  to implement these histories and caches. Although our metric names is generally hierarchical, And patricia Tries/radix trees seemed a perfect fit, I cou LDN ' t find nor build one that could compete with Go's map implementation, even for sets on the order of tens of millions o F elements. Comparing lots of substrings as you traverse the tree kills it lookup performance compared to the hash, and Memory-wise, 8-byte pointers mean you need pretty large matching substrings to save space over a map. It is also trickier to expire entries to keep memory usage bounded.

Even with maps, we were still isn't seeing the types of throughput I thought we could achieve with Go. Map operations were prominent in our profiles. Could we get any more performance out of them? All of our existing indexes were based on string data which had associated integer IDs in We backend, so I benchmarked th e insert/hashing performance for maps with the integer keys and maps with string keys:

BenchmarkTypedSetStrings     1000000          1393 ns/opBenchmarkTypedSetInts       10000000           275 ns/op

This looked pretty promising. Since the data points coming from the queue were already normalized to their IDs, we had the integers available for use as Map keys without have to does extra work. Using a map[int]*Metric instead of a map[string]struct{} would give us that integer key we knew would is faster while keeping access to the strings We needed for the indexes. Indeed, it was much faster:the overall throughput doubled.

Aes-ni processor Extns really boost string hash performance

Eventually, we wanted to add new indexes which track recently seen "apps". This concept was based on some AD-HOC structure in the metric names themselves, which generally looked like "app.throughput "or" app.latency ". We didn ' t has associated backend IDs for apps, so we restored the string-keyed maps for them, and overall throughput Dropp Ed like a stone. Predictably, the string map assignment in the app history, which we already knew to be slow, is to blame:

In fact, The runtime strhash  → runtime Memhash  path dominated the output, using more time than all Other integer hashing , and  all of our channel communication. This is illustrated proof, if proof were needed, so one shouldprefer structs to maps wherever a simple collection O F named values is required.

Still, the Strhash performance here seemed pretty bad. How do hashing take up so much more time under heavy insertion than all other map overhead? These were not large keys. When I asked about improving string hash performance in#go-nuts, someone tipped me off to the fact that since Go 1.1, has a fast-path that uses the Aes-ni processor extensions.

A Quick grep aes /proc/cpuinfo showed that the AWS C1.xlarge box I is onlacked these. After finding another machine in the same class with them, throughput increased by 50-65% and Strhash ' s Prominenc E is drastically reduced in the profiles.

Note that the string vs Int. int profiles on sets above is done on a machine withoutthe Aes-ni support. It goes without saying that these extensions would bring those results closer together.

De-mystifying Channels

The queue we read from sends messages which contain many individual metrics; In Go terms you can think of a message liketype Message []Metric, where the length is fairly variable. I made the decision early on to standardize our unit of channel communication on the single metric, as they is all the SA Me size on the wire. This allowed is much more predictable memory usage and simple, stateless processing code. As the program started to come together, I gave it a test run on the production firehose, and the performance wasn ' t Satis Factory. Profiling showed a lot of time spent in the atomic ASM wrapperruntime·xchg(shown below) andruntime·futex.

These atomics is used in various places by the runtime:the memory allocator, GC, Scheduler, locks, semaphores, et al. In we profile, they were mostly descendent From runtime chansend  and selectgo , which is part of Go ' Schannel implementation. It seemed like the problem is a lot of locking and unlocking while using buffered channels.

While channels provide powerful concurrency semantics, their implementation are not magic. Most paths-sending, receiving, and selecting on async channels currently involve locking to maintain thread safety; Though their semantics combined with goroutines change the game, as a data structure they ' re exactly like many other imple Mentations of synchronized queues/ring buffers. There is a ongoing effort to improve channel performance, but this isn ' t going to result in a entirely lock free IM Plementation.

Today, sending or receiving calls Runtime lock on the channel shortly after establishing that it isn ' t nil. Though the channel performance work being do bydmitry looks promising, even more exciting for future Performanc E Improvements is he proposal for atomic intrinsics, which could reduce some overhead to all of these atomic locking prim Itives all over the runtime. At this time, it looks likely to miss 1.3, but would hopefully be revisited for 1.4.

My decision to send metrics one by one meant so we were sending, receiving, and selecting more often than necessary, loc King and unlocking many times per message. Although it added some extra complexity in the form of looping in our metric processing code, re-standardizing on passing Messages instead reduced the amount of these locking sends and reads so much it they virtually dropped off our Subsequen T profiles. Throughput improved by nearly 6x.

Cgo and Borders

One of the sources of slowness that I expected before joining the project was Go ' implementation of zlib. I ' d do some testing in the past that showed it is significantly slower than Python ' s for a number of file sizes in the Range of the typical sizes of our messages. The zlib C implementation have a reputation for being well optimized, and if I discovered that Intel had contributed a nu Mber of patches to it quite recently, I were interested to see how it would measure up.

Luckily, the Vitess project from YouTube had already implemented a really nice Go wrapper named Cgzip, which performed qui Te a bit better than Go ' gzip in my testing. Still, it is outperformed by Python's gzip, which puzzled me. I dove into the code both of Python's zlibmodule.c and Cgzip ' s reader.go, and noticed that Cgzip was managing its buffers From Go while Python is managing them entirely in C.

I ' d vaguely remembered some experiments that showed there is a bit of overhead to CGO calls. Further revealed some reasons for this overhead:

    • Cgo have to does some coordination with the go scheduler so that it knows this calling Goroutine is blocked, which might Involve creating another thread to prevent deadlock. This involves acquiring and releasing a lock.
    • The Go stack must is swapped out for a C stack, as it had no idea what the memory requirements is for the C stack, and th En they must be swapped again upon return.
    • There ' s a C shim generated for C function calls which map some of C and Go's Call/return semantics together in a clean-out ; eg. struct returns in C working as Multi-value returns in Go.

Similar to communicating via channels above, the communication between Go function calls and C function calls was taxed . If I wanted to find more performance, I ' d has to reduce the amount of communication by increasing the amount of work-done Per call. Because of the channel changes, entire messages were now the smallest processable unit in my pipeline, so the undoubtable Benefits of a streaming gzip reader were relatively diminished. I used Python ' s zlibmodule.c as a template to does all of the buffer handling in C, returning a Raw char *  i could copy into A []byte  on the Go side, and did some profiling:

452 byte test payload (1071 orig) Benchmarkunsafedecompress 200000 9509 ns/opbenchmarkfzlibdecompress 200000 10302 Ns/opBenchmarkC Zlibdecompress 100000 26893 ns/opbenchmarkzlibdecompress 50000 46063 ns/op7327 byte test Paylo AD (99963 orig) benchmarkunsafedecompress 10000 198391 ns/opbenchmarkfzlibdecompress 10000 244449 Ns/opbenchmarkczlibdecompress 10000 276357 ns/opbenchmarkzlibdecompress, 495731 ns/op35992 5 byte test payload (410523 orig) benchmarkunsafedecompress-1527395 ns/opbenchmarkfzlibdecompress 1 1583300 ns/opbenchmarkczlibdecompress 1885128 ns/opbenchmarkzlibdecompress 200 7779899 ns/op 

Above, "Fzlib" is my "pure-c" implementation of zlib for Go, "Unsafe" was a version of this where the final copy to []byte I s skipped but the underlying memory of the result must be manually freed, "Czlib" was vitess ' Cgzip Library modified to Han Dle zlib instead of gzip, and "zlib" is Go's built in library.

Measure everything

The end, the differences for Fzlib and Czlib were only notable on small messages. This is one of the few times in the project I optimized prior to profiling, and as you might imagine it produced some of The least important performance gains. As can see below, when at full capacity, the message processing code cannot keep up with the intake and parsing code, and the post-parsed Channel (purple) stays full while the post-processed Channel (blue) maintains some capacity.

You might think the obvious lesson-learn from the-is-that-age-old nut aboutpremature-optimization, but this chart T Aught me something far more interesting. The concurrency and communication primitives you get on Go allow your to build single-process programs in the same style yo U ' d use if building distributed systems, with Goroutines as your processes, channels your sockets, And select  completing the picture. You can then measure ongoing performance using the same well understood techniques, tracking throughput and latency incred ibly easily.

Seeing this pattern of expensive boundary crossing twice in quick succession impressed upon me the importance of identi Fying it quickly when investigating performance problems. I also learned quite a lot on CGO and its performance characteristics, which might save me from ill-fated adventures LA ter on. I also learned quite a lot on Python ' s zlib module, including some pathological memory allocation in its compression bu Ffer handling.

The tools your disposal to get the most performance out of Go is very good. The included benchmarking facilities in the testing library is simple but effective. The sampling profiler is low impact enough to being turned on production and its associated tools (like the chart output a Bove) highlight issues in your code with great clarity. The architectural idioms that feel natural on Go lend themselves to easy measurement. The source for the runtime was available, clean, and straightforward, and if you finally understand your performance issu ES, the language itself is amenable to fixing them.


http://jmoiron.net/blog/go-performance-tales/

Related Article

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.