How does Google uses Linux?

Source: Internet
Author: User
Tags semaphore

ks2009:how Google uses Linux by Jonathan Corbet
October, 2009 LWN ' s 2009 Kernel Summit coverage There May is no single organization which runs more Linux systems than Google. But The kernel development community knows little about how Google uses Linux and what sort of problems are encountered th Ere. Google ' s Mike Waychison traveled to Tokyo to help shed some light on this situation; The result is a interesting view on what it takes to run Linux in this extremely demanding setting.

Mike started the talk by giving the developers a good laugh:it seems, Google manages its kernel code with Perforce. He apologized to that. There is a, all developers commits to. About every months, Google rebases its work to a current mainline release; What follows is a long struggle to make everything work again. Once that's done, internal "feature" releases happen about every six.

This way of the doing things is far from ideal; It means that Google lags far behind the mainline and has a hard time talking with the kernel development its problems.

There are about engineers working on Google ' s kernel. Currently they tend to check their changes into the tree, then forget is about them for the next months. This is leads to some real maintenance issues; Developers often have little idea of the what ' s actually in Google's tree until it breaks.

The

And there ' s a lot in the tree. Google started with the 2.4.18 kernel-but they patched over files, inserting 492,000 of code. Among things, they backported 64-bit support into that kernel. Eventually they moved to 2.6.11, primarily because, they needed SATA support. A 2.6.18-based kernel followed, and they are now working on preparing a 2.6.26-based kernel for deployment in the near Ure. They are currently carrying 1208 patches to 2.6.26, inserting almost the 300,000 of code. Roughly 25% of those patches, Mike estimates, are backports of newer features.

There are plans to change all of this; Google ' s kernel group is trying to a point where they can work better with the kernel community. They ' re moving to git for source code management, and developers'll maintain their the changes in their own. Those trees is rebased to mainline kernel releases, every; That should, it's hoped, motivate developers to make their code more maintainable and more closely aligned with the UPSTR EAM kernel.

Linus asked:why aren ' t these patches upstream? Is it because Google are embarrassed by them, or are it secret stuff that they don ' t want to disclose, or are it it a matter of Internal process problems? The answer was simply "Yes." Some of this code is ugly stuff which has been carried forward from the 2.4.18 kernel. There are also doubts internally about how much of this stuff would be actually useful to the rest of the world. But, perhaps, maybe about half of this code could is upstreamed eventually.

As much as 3/4 of Google's code consists of changes to the core kernel; Device support is a relatively small part of the total.

Google has a number of "pain points" which make working with the community harder. Keeping up and the upstream kernel is hard-it simply moves fast. There is also a real problem with developers posting a patch, then being asked to rework it in a way which turns it into a Much larger project. Alan Cox had a simple response to that one:people'll always ask for more, but sometimes the right thing to do are to SIM Ply tell them "No."

In the ' area ' CPU scheduling, Google found the move to the completely fair scheduler to be painful. In fact, it is such a problem that they finally forward-ported the "Old O" (1) scheduler and can run it in 2.6.26. Changes in the Semantics of Sched_yield () created grief, especially with the user-space locking, Google uses. High-priority threads can make a mess of the load balancing, even if they run for very short periods of. and load balancing Matters:google runs something like 5000 threads on systems with 16-32.

On the memory management side, newer kernels changed the management of dirty bits, leading to overly aggressive. The system could easily get into a situation where lots of small I/O operations generated by KSWAPD would fill the request Queues, starving other writeback; This particular problem should is fixed by the Per-bdi writeback Changesin.

As noted above, Google runs systems with lots of threads-not a uncommon mode of operation in general. One thing they found is this sending signals to a large thread group can leads to a lot of run queue lock contention. They also have trouble with contention for the Mmap_sem semaphore; One sleeping reader can block a writer which, in turn, blocks other readers, bringing the whole thing to a halt. The kernel needs to is fixed to not wait for I/O with that semaphore held.

Google makes a lot of the out-of-memory (OOM) killer to pare back overloaded systems. That can-create trouble, though, when processes holding mutexes encounter the OOM. Mike wonders why the kernel tries so hard, rather than just failing allocation requests when memory gets too.

So what are Google doing with all this code in the kernel? They try very hard the most out of every machine They have, so They cram a lot of work. This are segmented into three classes: latency sensitive, which gets short-term resource work guarantees, production bat CH "which has guarantees over longer periods, and" best effort "which gets no guarantees in all. This separation of classes are done partly through the separation of each machine to a large number of fake "NUMA nodes." Specific jobs are then assigned to one or more of those nodes. One thing added by Google to "Numa-aware VFS LRUs"-Virtual memory management which focuses on specific NUMA nodes. Nick Piggin remarked that him has been working on something-like, and would have liked to have seen Google ' s code.

There is a special Sched_gidle scheduling class which is a truly idle class; If there is no spare CPU available, the jobs in that class would not run on all. To avoid priority inversion problems, sched_gidle processes have their priority temporarily increased whenever In the kernel (but isn't if they are preempted in user spaces). The networking is managed with the HTB queueing discipline, augmented with a bunch of bandwidth. For disks, they are working on proportional I/O scheduling.

Beyond that, a lot of the Google ' s code is there for monitoring. They monitor all disk and network traffic, the record it, and use it for analyzing their operations on. Hooks have been added to let them associate all disk I/O back to applications-including asynchronous writeback I/O. Mike. Was asked if they could to use tracepoints the for this task; The answer was "yes," but, naturally enough, Google are using its-own scheme now.

Google has a lot of important goals for 2010; They include:

They are excited about CPU limits; These are intended to give priority access to latency-sensitive tasks while still keeping tasks from those System entirely.

Rpc-aware CPU scheduling; This involves inspection of the incoming RPC traffic to determine which process would wake up in response and how important tha T Wakeup is.

A related initiative is delayed scheduling. For most threads, latency isn't all that important. But the kernel tries to run them immediately when RPC messages in; These messages tend is evenly distributed across CPUs, leading to serious load balancing problems. So threads can is tagged for delayed scheduling; When a wakeup arrives, they are not immediately put onto the run queue. Instead, the wait until the next global load balancing operation before becoming.

Idle Cycle injection:high-bandwidth power Management so they can run their machines right on the edge of melting down-b UT not beyond.

Better memory controllers are on the list and including accounting for kernel to use.

"Offline memory." Mike noted that it's increasingly hard to buy memory which actually works, especially if you want to go cheap. So they need to is able to set bad pages aside. The Hwpoison work may help them.

They need dynamic huge pages, which can is assembled and broken down on demand.

On the networking side, there are a desire to improve support for Receive-side scaling-directing incoming to spec ific queues. They need to is able to account for software interrupt time and attributes it to specific tasks-networking processing can Often involve large amounts of SOFTIRQ processing. They ' ve been working on better congestion control; The algorithms they have come up with are ' not Internet safe ' but work in the data center. and "TCP pacing" slows down outgoing traffic to avoid overloading.

For storage, there are a lot of interest in reducing block-layer and so it can overhead up with keep flash. Using Flash for disk acceleration in the "block" layer is on the list. They ' re looking at In-kernel Flash translation layers, though it is suggested that it might is better to handle that Logi C directly in the filesystem.

Mike concluded with a couple of "interesting problems." One of those is this Google would like a way to pin filesystem metadata in memory. The problem is being able to bound the time required to service I/O requests. The time required to read a blocks from disk is known, but if the relevant metadata isn't in memory, more than one disk i/ O operation may be required. That slows things the undesirable ways. Google is currently getting around this by reading file data directly to Raw disk devices in user spaces but they would Like to stop doing that.

The other problem is lowering the system call overhead for providing caching advice (withfadvise ()) to the kernel. It ' s not clear exactly what the problem is here.

All told, it is seen as one of the more successful sessions, with the kernel community learning a lot about one of its Biggest customers. If Google ' s plans to become more community-oriented come to fruition, the result should is a better kernel for all.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.