Performance without the event loop

Source: Internet
Author: User
This is a creation in Article, where the information may have evolved or changed.

This article was based on a presentation I gave earlier the year at OSCON. It has been edited for brevity and to address some of the points in feedback I received after the talk.

A common refrain when talking about Go is it's a language that works well on the server; Static binaries, powerful concurrency, and high performance.

This article focuses in the last both items, how the language and the runtime transparently let Go programmers write highly Scalable network servers, without have to worry about thread management or blocking I/O.

An argument to an efficient programming language

But before I launch into the technical discussion, I-want to make II arguments to illustrate the "Go targets."

Moore ' s Law

Image Credit:herb Sutter (Dr. Dobb ' s Journal, March 2005)

The oft mis quoted Moore's law states the number of transistors per square inch doubles roughly every.

However clock speeds, which is a function of entirely different properties, topped out a decade ago and the Pentium 4 an D has been slipping backwards ever since.

From space constrained to power constrained

Sun Enterprise e450-about the size of a bar fridge, about the same power consumption. Image Credit:ebay

This is the Sun e450. When I started my career, these were the workhorses.

These things were massive. Three of them, stacked one on top of the another would consume an entire 19″rack. They only consumed on Watts each.

Over the last decade, data centres has moved from space constrained to power constrained. The last of the data centre rollouts I is involved in, we ran out of power when the rack is barely 1/3rd full.

Because compute densities has improved so rapidly and data centre space is no longer a problem. However, modern servers consume significantly more power, in a much smaller area, making cooling harder yet at the same ti Me critical.

Being power constrained have effects at the macro level-you can ' t get enough power for a rack of Watt 1RU Servers-and At the micro level, all this power, hundreds of Watts, was being dissipated in a tiny silicon die.

Where does this power consumption come from?

CMOS inverter. Image Credit:wikipedia

This was an inverter, one of the simplest logic gates possible. If the input, A, is-high, then the output, Q, would be-low, and vice versa.

All today ' s consumer electronics is built with CMOS logic. CMOS stands for complementary Metal Oxide Semiconductor. The complementary part is the key. Each logic element inside the CPU was implemented with a pair of transistors, as one switches on, another switches off.

When the circuit are on or off, no current flows directly from the source to the drain. However, during transition there is a brief period where both transistors be conducting, creating a direct short .

Power consumption, and thus heat dissipation, is directly proportional to number of transition per second-the CPU clock SP Eed1.

CPU feature size reductions is primarily aimed at reducing power consumption. Reducing power consumption doesn ' t just mean "green". The primary goal is to keep power consumption, and thus heat dissipation, below levels that would damage the CPU.

With clock speeds falling, and in direct conflict with power consumption, performance improvements come mainly from Microa Rchitecture tweaks and esoteric vector instructions, which is not directly useful for the general computation. Added up, each microarchitecture (5 cycle) change yields at the most 10% improvement per generation, and more re Cently barely 4-6%.

"The free lunch are over"

Hopefully it is clear to your now that hardware are not getting any faster. If performance and scale is important to your, then you'll agree with me, the days of throwing hardware at the problem Is over, at least in the conventional sense. As Herb Sutter put it "the free lunch are over".

You need a language which are efficient, because inefficient languages just do not justify themselves in production, at SCA Le, on a capital expenditure basis.

An argument for a concurrent programming language

My second argument follows from my first. CPUs is not getting faster, but they is getting wider. This is where the transistors was going and it shouldn ' t be a great surprise.

Image Credit:intel

Simultaneous multithreading, or as Intel calls it Hyper Threading allows a single core to execute multiple Instru Ction streams in parallel with the addition of a modest amount of hardware. Intel uses hyper threading to artificially segment the market for processors, Oracle and Fujitsu apply Hyper threading Mor e aggressively to their products using 8 or hardware threads per core.

Dual socket have been a reality since the late 1990s with the Pentium Pro and are now mainstream with most servers Supportin G dual or quad socket designs. Increasing transistor counts has allowed the entire CPU to being co-located with siblings on the same chip. Dual core on mobile parts, quad core on desktop parts, even + cores on server parts is now the reality. You can buy effectively as many cores in a server as your budget would allow.

and to take advantage of these additional cores, you need a language with a solid concurrency story.

Processes, Threads and Goroutines

Go has goroutines which is the foundation for it concurrency story. I want to step back for a moment and explore the history of that leads us to goroutines.

Processes

In the beginning, computers ran one job at a time in a batch processing model. In the "s a desire for more interactive forms of computing leads to the development of multiprocessing, or time sharing, Operating systems. By the the the the "S" is well established for network servers, FTP, Telnet, rlogin, and later Tim Burners-lee ' s CERN http D, handled each incoming the network connections by forking a child process.

In a time-sharing system, the operating systems maintains the illusion of concurrency by rapidly switching the attention O f the CPU between active processes by recording the state of the current process and then restoring the state of another. This is called context switching.

Context switching

Image Credit:immae (CC BY-SA 3.0)

There is three main costs of a context switch.

    • The kernel needs to store the contents of all the CPU registers for so process, then restore the values for another proc Ess. Because a process switch can occur at any point in a process ' execution, the operating system needs to store the contents Of all the these registers because it does not know which is currently in use 2.
    • The kernel needs to flush the CPU's virtual address to Physical address mappings (TLB cache) 3.
    • Overhead of the operating system context switch, and the Overhead of the scheduler function to choose the next process to Occupy the CPU.

These costs is relatively fixed by the hardware, and depend on the amount of work do between context switches to Amorti SE their cost-rapid context switching tends to overwhelm the amount of work-done between context switches.

Threads

This leads to the development of threads, which is conceptually the same as processes, but share the same memory space. As threads share address space, they is lighter to schedule than processes, so is faster to create and faster to switch Between.

Threads still has a expensive context switch cost; A lot of state must is retained. Goroutines take the idea of threads a step further.

Goroutines

Rather than relying on the kernel to manage their time sharing, Goroutines is cooperatively scheduled. The switch between Goroutines only happens at well defined points, when an explicit call is made to the Go runtime Schedul Er. The major points where a goroutine would yield to the scheduler include:

    • Channel send and receive operations, if those operations would block.
    • The GO statement, although there is no guarantee that new Goroutine would be scheduled immediately.
    • Blocking syscalls like file and network operations.
    • After being stopped for a garbage collection cycle.

In other words, places where the goroutine cannot continue until it had more data, or more space to put data.

Many goroutines is multiplexed onto a single operating system thread by the Go runtime. This makes goroutines cheap to the Create and cheap to switch between. Tens of thousands of goroutines in a single process is the norm, hundreds of thousands is not unexpected.

From the point of view of the language, scheduling looks as a function call, and have the same semantics. The compiler knows the registers which is in use and saves them automatically. A thread calls into the scheduler holding a specific goroutine stack, and may return with a different goroutine stack. Compare this to threaded applications, where a thread can is preempted at any time, in any instruction.

This results on relatively few operating system threads per go process, with the GO runtime taking care of assigning a run Nable Goroutine to a free operating system thread.

Stack Management

In the previous sections I discussed how goroutines reduce the overhead of managing many, sometimes hundreds of thousands o F Concurrent threads of execution. There is another side to the Goroutine stories, and that's stack management.

Process address Space

This is a diagram of the typical memory layout of a process. The key thing we are interested on is the locations of the heap and the stack.

Inside The address space of a process, traditionally the heap is in the bottom of memory, just above the program code and Grows upwards.

The stack is located at the top of the virtual address space, and grows downwards.

Because the heap and stack overwriting each other would is catastrophic, the operating system arranges an area of inaccess Ible memory between the stack and the heap.

This was called a guard page, and effectively limits the stack size of a process, usually in the order of several megabytes .

Thread Stacks

Threads share the same address space, so for each thread, it must has its own stack and its own guard page.

Because It is hard to predict the stack requirements of a particular thread, a large amount of memory must being reserved for Each thread ' s stack. The hope is, this would be a more than needed and the guard page would never be hit.

The downside is, and the number of threads in your program increases, the amount of available address space is reduced.

Goroutine Stack Management

The early process model allowed the programmer to view the heap and the stacks as large enough to is a concern. The downside was a complicated and expensive subprocess model.

Threads improved the situation a bit, but require the programmer to guess the most appropriate stack size; Too small, your program would abort, too large, you run out of virtual address space.

We ' ve seen that the Go runtime schedules a large number of goroutines onto a small number of threads and what's about the S Tack requirements of those goroutines?

Goroutine Stack Growth

Each goroutine starts with a small stack, allocated from the heap. The size has fluctuated over time, but in Go 1.5 each goroutine starts with a 2k allocation.

Instead of using guard pages, the Go compiler inserts a check as part of every function call to test if there is Sufficien t stack for the function to run. If there is sufficient stack space, the function runs as normal.

If There is insufficient space, the runtime would allocate a larger stack segment on the heap, copy the contents of the Cur Rent stack to the new segment, free the old segment, and the function call is restarted.

Because of this check, a goroutine's initial stack can be made much smaller, which on turn permits Go programmers to treat Goroutines as cheap resources. Goroutine stacks can also shrink if a sufficient portion remains unused. This is handled during garbage collection.

Integrated Network Poller

In 2002 Dan Kegel published what he called the c10k problem. Simply put, how to write server software so can handle at least-TCP sessions on the commodity hardware of the day . Since that paper was written, conventional wisdom have suggested that high performance servers require native threads, or M Ore recently, event loops.

Threads carry a high overhead in terms of scheduling cost and memory footprint. Event loops ameliorate those costs, but introduce their own requirements for a complex, callback driven style.

Go provides programmers the best of both worlds.

Go ' s answer to c10k

In Go, Syscalls is usually blocking operations, this includes reading and writing to file descriptors. The Go Scheduler handles this by finding a free thread or spawning another to continue to service Goroutines while the Ori Ginal thread blocks. In practice the works well for file IO as a small number of blocking threads can quickly exhaust your local IO bandwidth.

However for network sockets, by design at any one time almost all of your goroutines is going to being blocked waiting for n Etwork IO. In a naive implementation this would require as many threads as Goroutines, all blocked waiting on network traffic. Go ' s integrated network Poller handles this efficiently due to the cooperation between the runtime and net packages.

In older versions of Go, the network Poller is a single goroutine that is responsible for polling for readiness Notifica tion using Kqueue or epoll. The polling goroutine would communicate back to waiting goroutines via a channel. This achieved the goal of avoiding a thread per syscall overhead, but used a generalised wakeup mechanism of channel sends . This meant the scheduler is not aware of the source or importance of the wakeup.

In current versions of Go, the network poller have been integrated into the runtime itself. As the runtime knows which Goroutine is waiting for the socket to become ready it can put the goroutine back on the same C PU as soon as the packet arrives, reducing latency and increasing throughput.

Goroutines, stack management, and an integrated network Poller

In conclusion, Goroutines provide a powerful abstraction this free the programmer from worrying about thread pools or even T loops.

The stack of a goroutine is as big as it needs to being without being concerned about sizing thread stacks or thread pools.

The integrated network Poller lets GO programmers avoid convoluted callback styles while still leveraging the most Efficie NT IO completion logic available from the operating system.

The runtime make sure this there would be just enough threads to service all your goroutines and keep your cores active.

And all of these features is transparent to the Go programmer.

Footnotes:

  1. CMOS power consumption is isn't only caused by the short circuit of the the circuit is switching. Additional power consumption comes from charging the output capacitance of the gate, and leakage current through the MOSFE T gate increases as the size of the transistor decreases. You can read more on this from in a lecture materials from CMU ' s ECE322 course. Bill Herd has a published a series of articles in how CMOS works.
  2. This was an oversimplification. In some cases the operating system can avoid saving and restoring infrequently used architectural registers by starting th E The process in a mode where access to floating point or mmx/sse registers would cause the program to fault, thereby infor Ming the kernel that the process would now have those registers and it should from then on save and restore them.
  3. Some CPUs has the what is known as a tagged TLB. In the case of tagged TLB, the operating system can tell, the processor to associate particular TLB cache entries WI Th an identifier, derived from the process ID, rather than treating each cache entry as global. The upside is this avoids flushing off entries on each process switch if the process are placed back on the same CPU in Sho RT order.

Related Posts:

    1. Hear me speak about Go performance at OSCON
    2. Go 1.1 Performance improvements
    3. Go 1.1 Performance improvements, Part 3
    4. Go 1.1 Performance improvements, Part 2

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.