An article on Linux communication

Last Update:2017-07-21 Source: Internet

Author: User

Tags semaphore

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

At this year's Linux kernel development conference, Google's developers also took the stage to make a speech called "How Google Use Linux". I venture to translate annotations-in parentheses for annotations, welcome reader treatise.

Link to the original text see: http://lwn.net/Articles/357658/

(in the previous paragraphs, Google on the Linux kernel Code management and follow-up, partial fragmentation, not translated)

In the code that Google has added to Linux, 3/4 is a change to the kernel core, and device-driven code is only a relatively small part of it.

(Linux has grown to the point where new device drivers need to be added.)

If Google wants to work with the Linux community, it will face a number of issues. It's too hard to keep up with the backbone of the Linux code-its code is updating too fast. In a large project, it is really a problem for developers to submit and re-modify patches. Alan Cox's answer to this is simple: people are always greedy, but sometimes they should simply say "no" to them.

(Alan Cox is the second hero of Linux kernel and is now a member of Intel Corporation.) I think CPU companies like Intel are good for kernel developers.

On CPU scheduling, Google found it cumbersome to switch to the new CFS ("Fully fair dispatch", which was added to the 2.6.23 by Con Kolivasy). Because of too much trouble, Google had to go back to the O (1) sheduler (2.6.23 before the kernel used the scheduling algorithm) to the 2.6.26, everything can work. The kernel's changes to the Sched_yield () semantics are also problematic, especially if Google uses a user-state lock. High-priority threads load-balance the server (where load balancing refers to the CPU-to-multitasking allocation processing on a single server, not the distribution), even if the threads are only running for a short time. Load balancing is important: Google typically runs 5,000 threads on a 16-32-core server (a weird usage!). ）。

In memory management, the new Linux kernel has changed the management of dirty data, resulting in a large number of active writeback operations (dirty data to be written back to the hard disk). The system is prone to this situation: the kswaped (Swap process) results in a large number of small I/O operations, crammed into the request queue of the block, resulting in other writeback not being completed (write back to "hunger"), which has been addressed through the "PER-DBI writeback mechanism" patch in the 2.6.32 kernel.

(The main principle of per-dbi is that block devices no longer have only one waiting queue, but more than one queue per "hdd axis", because the hard disk axis is a real unit of work on hardware.) This will have good I/O performance for servers that assemble multiple hard disks. But my personal guess is that if you can merge kswaped's small requests, can it also improve performance? ）

As mentioned above, Google launches a lot of out-of-the-box-unusual usage in the system. They found that if a signal is sent to a large group of threads, it can cause a lot of competition to run queue locks. Google also found that the Mmap_sem semaphore (which is the kernel semaphore used to protect mmap space in the kernel struct mm_struct) has a competition problem; a sleeping reader blocks the writer, and the writer blocks the other readers, eventually causing the system to deadlock. The kernel should be modified to not wait for I/O after getting the semaphore.

(Google's signal to the thread group caused by the problem is estimated to be "surprise group effect", that is, many tasks sleep in a queue, a wake-up operation will cause them to wake up suddenly, the result must be resource congestion. I personally think this is not a Linux problem, this is Google's way of using Linux is too peculiar, so the kernel developers did not notice it)

Google has heavily used oom killer to relieve the burden of high-load servers. This can be problematic when the lock-owning process is killed by Oom (the lock is not released and the result blocks other tasks). Mike (the speaker) was wondering why kernel so hard to get an oom killer, instead of simply returning an error after allocating memory failed.

(Not only Mike, everyone has this question, it is estimated that the answer can only be found in the kernel Mail group.) And Google said that the "process was killed lock but did not release, causing blocking" problem, Yahoo in the era of freebsd-4.11 has been solved, with a very clever and lightweight approach. We all think that Google's technology is the most bull, in fact, just say, cattle company a lot of cattle, but everyone is not so high-profile. But for the country, the company that can improve the server by improving the kernel is very rare. ）

(also omit a Google for the kernel development work classification, do not understand)

Google has added a sched_gidle scheduling class, which is the real idle class, and if there is no CPU to use, the tasks belonging to this class will not run (or even participate in the capture of the CPU). To avoid the "priority reversal" problem, the process of the Sched_gidle class temporarily increases its priority during sleep (where the kernel sleeps, not sleep caused by system calls). The network is managed by the HTB queuing rules and is equipped with a set of flow control logic. For hard disks, they work on Linux I/O scheduling.

(assuming that three processes a, B, C, the priority is a>b>c, assuming that C runs first, accounting for an important shared resource, a also wants this resource, so a waits for C to complete, but because B's priority is higher than C, the result of C is not finished scheduling to B run, so overall, B runs ahead of a, Although the priority of a is higher than B. This is the "priority reversal" issue. The usual solution is: who occupies the important shared resources, who will temporarily elevate their priorities, such as C accounted for the resources after a temporary rise to the same high, release resources and then bring back the priority. Plainly, the resource-occupying bunch, they'd better have the same priority, or they'd be in trouble.

In addition to these, Google also has a lot of code for monitoring. Monitor all hard drives and network traffic and record them for later analysis of operations. Google has a lot of hooks in the kernel so that they can return all the hard disk I/O to the application-including asynchronous writeback I/O. When Mike was asked if they were using track points, the answer was "yes", but, naturally, Google was using its own set of tracking methods.

Google core improvements have many important goals in the 2010:

Google is excited about CPU throttling, which gives the "low-latency task" a higher priority without worrying about taking the entire system.

RPC-based CPU task scheduling, which includes monitoring RPC ingress traffic to determine which process to wake. (This is a very distributed OS flavor)

Delay scheduling. For many tasks, latency is not a big deal. But when the RPC message arrives, the kernel tries to run all of these tasks, and the messages are not distributed across different CPUs (meaning that the service process that handles the requests may be running on a certain number of CPUs), which causes the system load to be distributed between CPUs. So the task should be labeled "delayed scheduling"; When awakened, these tasks are not placed directly on the run queue, but wait until the global CPU load schedule is complete.

Insert an idle period. Advanced Power Management allows Google to use the server near the burning edge-but not beyond that edge.

Better memory management has been included in the plan, including statistical kernel memory usage.

"Offline Memory". Mike insists that it's getting harder to buy cheap and usable memory. So Google needs the ability to label bad memory. Hwpoison might be able to help them.

On the web side, Google wants to improve support for "receive-side scaling"-that is, the input traffic is directed to the specified queue. Google also needs to count the number of software interrupts and then transfer them to the specified task-the network process usually contains a large number of soft interrupts. Google has done a lot of work on congestion control, and they have developed a "network insecure" congestion algorithm that works well in Google's data center. The algorithm named "TCP pacing" slows down the server's output traffic and avoids the switch overloading.

(The company that manages the data center is different, network optimization is done very fine)

On the storage side, Google has spent a lot of effort to reduce the bottleneck of the block device layer, so that it can better use high-speed flash. Increased storage efficiency through flash at the block device layer has been included in the development plan. Google is considering adding the flash conversion layer to the kernel, but it is recommended that Google also put the logic of operating flash directly into the file system layer better.

Mike concludes with a few "questions of interest". One is that Google wants to fix the file system's meta-data in memory. The goal is to reduce the time for I/O requests. The time to read a block from the hard disk is known, but if the metadata is not in memory, more than one I/O operation will be executed (there will be new I/O operations to read the metadata). This slows down the speed at which the files are read. Google's current solution to this problem is to read the data directly from the original block device to the user space (estimated to use O_direct, and then manage the metadata cache yourself in the user configuration, so that the system cache will not be used), but later do not want to do so.

(I do not know what Google said filesystem metadata exactly what data, because the different file systems, metadata is also very different, since Google said to put the data in memory, it should not be small, it is not a good cache? Hope to have a chance to ask Mike)

Another problem is that using fadvise can degrade the performance of system calls. The details of the problem are not clear at the moment.

Google's speech was a success, and the Linux community learned a lot from its biggest customers. If Google's plans for a more Linux-based community were to be put into action, Linux would have a better kernel.

(Note: Google can be regarded as the most use of Linux in IT companies, it is probably the most used, 30 people of the kernel development team, very impressive.) Look at the domestic, few companies, few people for the open source has made contributions, alas, it is ashamed to say, in the next one AH. ）

An article on Linux communication

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More