High-performance Server Architecture
High-Performance Server architecture
Source: http://pl.atyp.us/content/tech/servers.html
Source: http://www.lupaworld.com/home/space-341888-do-blog-id-136718.html
(Map Note: I read again, "In my Heart Has Obsession Yan", translation is also very good, so tidy up a bit, re-release, Memo)
Introduction
This article will share with you some of my years of experience in server development. For the server described here, the more precise definition should be processing a large number of discrete messages per second or request the service program, the network server more in line with this situation, but not all network programs are strictly the server. Using a "High performance request Handler" is a bad title, and for the sake of simplicity, the following will be referred to as "server".
This article does not involve multitasking applications, and it is now common to handle multiple tasks simultaneously in a single program. For example, your browser may be doing some parallel processing, but this kind of parallel programming is not much of a challenge. The real challenge arises when the architecture of the server is constrained by performance, and how to improve the performance of the system by improving the architecture. Multiple concurrent download tasks over a DSL are not so challenging for browsers that run on top g memory and a G-Hertz CPU. Here, the focus of the application is not to suck through the small mouth of the straw, but how to drink through the tap, where the trouble is how to solve the constraints in hardware performance. (the author's idea is how to increase the traffic by improving the network hardware)
Some people may question some of my ideas and suggestions, or think there is a better way, which is unavoidable. I don't want to play the role of God in this article, and I'm talking about some of my own experiences that have worked for me not only to improve server performance, but also to reduce debugging difficulties and increase the scalability of the system. But the system may be different for some people. If there are other ways that are better for you, that would be great. But it is worth noting that the conclusions I have drawn from the experiment are pessimistic about other alternatives to each of the suggestions presented in this article. Your own cleverness may have been a better performance in these experiments, but if I were to encourage the reader to do so, it would be objectionable to innocent readers. You don't want to annoy the reader, do you?
The remainder of this article will focus on the four killers that affect server performance:
1) data copy (Copies)
2) Environment switch (context switches)
3) memory allocation (allocation)
4) lock Competition (lock contention)
At the end of the article there are other important factors to be raised, but the above four points are the main factors. If the server can handle most requests without data copy, no environment switch, no memory allocation, no lock competition, then I am sure your server performance must be excellent.
data Copy (Copies)
This section is a bit short because most people have learned a lesson from copying data. Almost everyone knows it's wrong to produce a copy of the data, and it's obvious that you've seen it early in your career, and you've had this problem since someone started saying that word 10 years ago. That's true for me. Today, it is mentioned in almost every university course and almost every how-to document. Even in some business brochures, "0 copies" is a popular term.
Although the disadvantage of data copying is obvious, some people will ignore it. Because the code that produces the copy of the data is often hidden deep and disguised, do you know that the library or driver code you are invoking will copy the data? The answer is often beyond imagination. Guess what the "program I/O" means on the computer? A hash function is an example of a disguised copy of a data that has memory access consumption and more calculations with a copy. It has been pointed out that the hashing algorithm is an effective "copy +" that seems to be avoidable, but as far as I know, some very clever people have said that it is quite difficult to do this. If you want to really get rid of a copy of the data, either because it affects server performance or if you want to show the "0 copy" technology at the hacker conference, you have to keep track of all the places where data copies might occur, rather than believing in propaganda.
One way to avoid copying data is to use the descriptor of buffer (or buffer chains descriptor) instead of using the buffer pointer directly, each buffer descriptor should consist of the following elements:
L A pointer to buffer and the length of the entire buffer
L A pointer to the real data in buffer and the length of the real data, or the offset of the length
L provide pointers to other buffer in the form of a doubly linked list
L A reference count
Now, the code can simply add a reference count to the corresponding descriptor instead of the copy of the in-memory data. This practice is quite good under certain conditions, including in the operation of a typical network protocol stack, but in some cases it can be a very big thing. In general, it is easy to add buffer at the beginning and end of the buffer chains, adding a reference count to the entire buffer, and an immediate release of the buffer chains. It is more difficult to add buffer in the middle of the chains, to release buffer pieces, or to add a reference technique to some buffer. And the split, combination chains will let people immediately collapse.
I don't recommend using this technique under any circumstances, because when you want to search a chain for a block you want, you have to traverse through the descriptor chain, which is even worse than a copy of the data. The best place to use this technique is on large chunks of data in the program, which should be assigned a separate descriptor as described above, to avoid copying, and to avoid affecting the rest of the server's work. (large data block copies are CPU-intensive and can affect the operation of other concurrent threads.)
The final point about data copying is to avoid extremes when copying data. I've seen too much code to avoid copying data, and the result is worse than copying data, such as creating an environment switch or a large I/O request being decomposed. A copy of the data is expensive, but when it is avoided, it is diminishing returns (meaning that it is too much to be done, but not good). Instead of changing the code to get rid of the last few copies of the data, and then doubling the complexity of the code, it's better to spend time on other things.
Context Switch (contextual switches)
A significant number of people ignore the impact of context switching on performance relative to the data copy. In my experience, context switching is the real killer for high-load applications, compared to data copies. The system spends more time switching on the thread, rather than on threads that actually do useful work. Surprisingly, the cause of context switching is always more common (compared to data copies) at the same level. The first cause of an environment switchover is often more active threads than the number of CPUs. As the number of active threads increases relative to the number of CPUs, the number of context switches is also increasing, and if you're lucky, this growth is linear, but more commonly, exponential growth. This simple fact explains why the multi-threaded design of each connected thread is less scalable. Limiting the number of active threads to less than or equal to the number of CPUs is a more practical scenario for a scalable system. Once a variant of this scenario is the use of only one active thread, although this scenario avoids the environment contention and also avoids the lock, it does not effectively utilize the value of the multi-CPU to increase the total throughput, so unless the program has no CPU limit (Non-cpu-bound), (typically network I/O throttling Network-i/o-bound), more practical scenarios should continue to be used.
The first thing to consider when a program with an appropriate amount of threading is to plan out how to create a thread to manage multiple connections. This usually means a pre-select/poll, asynchronous I/O, signal or completion port, and the latter uses an event-driven program framework. There is a lot of controversy about which pre-API is best. Dan Kegel's c10k is a good paper in this area. Personally, select/poll and signals are often an ugly solution, so I prefer to use AIO or complete the port, but in fact it's not much better. Maybe they're all good except for select (). So don't spend too much time exploring what's happening in the outermost interior of the front-end system.
For the simplest conceptual model of multi-threaded event-driven servers, there is a request cache queue inside, where a client request is fetched by one or more listener threads and placed in the queue, and one or more worker threads pull the request out of the queue and process it. Conceptually, this is a good model, and there are a lot of ways to implement their code in this way. What is the problem? The second reason for an environment switchover is to transfer the processing of the request from one thread to another. Some people even change the response of the request back to the original thread to do it, which is worse, because each request causes at least 2 environment switches. It is important to use a "smooth" approach to avoid an environment switch by converting a request from a listener thread to a worker thread and back into a listener thread. At this point, it is not important to assign a connection request to multiple threads, or to have all threads sequentially serve as listener threads for each connection request.
Even in the future, there is no way to know how many threads are active at the same time in the server. After all, there may be requests to send from any connection every moment, and some "backstage" threads for special tasks will be awakened at any time. So how can you limit the number of active threads if you don't know how many threads are currently active? In my experience, one of the simplest and most effective methods is to use an old-fashioned semaphore with a count, and each thread executes with a semaphore. If the semaphore has reached the maximum, those in the listening mode may have an additional environment switch when the thread is awakened, (the listener thread is awakened because a connection request arrives, when the listener thread is holding the semaphore and the semaphore is full, so it immediately sleeps), then it is blocked on the semaphore, Once all the threads in the listening mode are blocked like this, they will no longer compete for resources until one of the threads releases the semaphore, so the impact of the environment switch on the system can be negligible. What's more, this approach makes it more graceful to avoid a single position in the number of active threads for most time-dormant threads, which is more elegant than other alternatives.
Once the process of processing a request is divided into two phases (listening and working), it is natural that these processes will be divided into more phases (more threads) in the future. The simplest scenario is that a complete request completes the first step, followed by a second step (such as a response). However, it will be more complex: a phase may produce two different execution paths, or it may simply generate an answer (such as returning a cached value). Each of these stages needs to know what to do next, and there are three possible ways to do this, depending on the return value of the stage distribution function:
L requests need to be passed to another stage (return a descriptor or pointer)
L The request has been completed (return OK)
L The request was blocked (return "request Blocking"). This is the same as in the previous case, blocking until another thread frees the resource
It should be noted that in this mode, the queuing of the stages is done within one thread, not through two threads. This avoids the need to keep the request in the next phase of the queue, followed by the request from the queue to execute. This stage is not necessary through a lot of activity queues and locks.
This way of splitting a complex task into smaller, more collaborative parts looks familiar, because it's really old. My approach stems from the "communication serialization process" invented by car in 1978 (Communicating sequential Processes CSP), which can be traced back to 1963 per Brinch Hansen and Matthew conway--before I was born! However, when Hoare creates the term CSP, the "process" is from an abstract mathematical perspective, and the process in this CSP term is not related to the process in the operating system with the same name. In my opinion, this method of implementing a CSP that works like multithreading in a single thread provided by the operating system makes many people headache in terms of scalability.
A practical example is Matt Welsh's Seda, an example that shows that the idea of segmented execution (stage-execution) is moving in a more reasonable direction. Seda is a good example of "server aarchitecture done Right", and it is worth commenting on its features:
- Seda's batching tends to emphasize one phase of processing multiple requests, and my approach tends to emphasize that one request is divided into multiple stages of processing.
- In my opinion, one of the major drawbacks of Seda is to apply for a separate thread pool for each phase in the "background" redistribution of threads in the load response phase. As a result, there is still a lot of environment switching caused by cause 1 and cause 2.
- In purely technical research projects, the use of Seda in Java is useful, but in practical applications, I think this approach is rarely chosen.
memory allocation (Allocator)
Applying and freeing memory is the most common operation in an application, so many clever tricks have been invented to make memory applications more efficient. However, the clever method cannot compensate for this fact: in many cases, the general memory allocation method is very inefficient. So in order to reduce the application of memory to the system, I have three suggestions.
The first recommendation is to use pre-allocation. We all know that it is a bad design to use static allocation to add artificial limitations to the functionality of the program. But there are many other very good pre-distribution options. It is generally thought that allocating memory one time through the system is better than dividing it several times, even if doing so wastes some memory in the program. If you can determine that there are several memory uses in the program, pre-allocation is a reasonable choice when the program starts. Even if you are not sure, it is better to pre-allocate all the memory you might need for the request handle at the beginning than it would be if you needed it every time. It is also possible to significantly reduce error-handling code by allocating multiple memory systems at once. Pre-allocation may not be a good choice when memory is tense, but pre-allocation is a no-win option unless you are facing the most extreme system environment.
The second recommendation is to use a memory-free allocated lookaside list (watch list or fallback list). The basic concept is to put the recently released object into the list instead of actually releasing it, and when the object is needed again soon, it is taken from the list directly and not allocated by the system. One additional benefit of using the Lookaside list is that you can avoid the initialization and cleanup of complex objects.
Generally, it is a bad idea to let the lookaside list grow without restriction, even if the object is not freed when the program is idle. In the case of avoiding the introduction of complex locks or competitions, it is necessary to "clean" inactive objects irregularly. A more appropriate approach is to have the Lookaside list consist of two linked lists that can be locked independently: a "new chain" and an "old chain". When used, the "new" chain is first assigned, and then the "old" chain is finally relied upon. The object is always released on the "new" chain. The purge thread runs as follows:
- Lock two chains
- Save the head node of the old chain
- Hang the previous link to the front of the old chain
- Unlock
- Start releasing all objects of the old chain at idle with the head node saved in the second step.
In a system that uses this approach, objects are freed only when they are really useless, releasing at least one purge interval (that is, the run interval of the purge thread), but not more than two interval periods. The purge thread does not compete with normal threads for locks. In theory, the same approach can be applied to multiple stages of a request, but I haven't found it so far.
One problem with Lookaside lists is that keeping an assigned object requires a linked list pointer (a linked list node), which may increase memory usage. But even in this case, the benefits of using it can make up for the extra memory costs.
The third suggestion has nothing to do with the locks we haven't discussed yet. Let it go without saying. Even with the lookaside list, lock contention at memory allocation is often the biggest overhead. The workaround is to use the thread-private lookasid list, which avoids competition among multiple threads. Further, one chain per processor would be better, but it would be useful only in non-preemptive threading environments. Based on extreme considerations, the private lookaside list can even be used in conjunction with a shared chain work.
Lock Competition (lock contention)
The high-efficiency lock is very difficult to plan, so I call it Boudis and Scylla (see appendix). On the one hand, the simplification of locks (coarse-grained locks) can result in the serialization of parallel processing, thus reducing the concurrency efficiency and system scalability; On the other hand, the complexity of locks (fine-grained locks) can result in performance erosion in both space occupancy and operating time consumption. In favor of coarse-grained locks there is a deadlock, and a preference for fine-grained locks creates competition. Between the two, there is a narrow path leading to correctness and efficiency, but where is the road?
Because the lock tends to bind to the logic of the program, it is impossible to plan the lock scheme without affecting the program's normal operation. That's why people hate locks and make excuses for the non-scalable single-threaded solution they design.
Almost every system in which the lock is designed starts with a "super lock that locks everything" and hopes that it will not affect performance, that when hope fails (almost inevitably), large locks are broken into multiple small locks, and then we continue to pray (performance is unaffected), and then repeat the entire process (many small locks are broken into smaller locks ) until the performance reaches an acceptable level. Typically, each repetition of the above process increases the complexity and lock load of more than 20%-50%, and reduces the lock contention by 5%-10%. The end result is moderate efficiency, but the reduction of actual efficiency is unavoidable. The designers started to freak out: "I've designed fine-grained locks in accordance with the instructions in the book, Why is the system performance still bad?"
In my experience, the above method is not correct on the basis of it. Think of a solution as a mountain, an excellent solution to the summit, and a bad plan to represent the valley. The solution, which starts with "super lock", is like a typical bad mountain climb, like a mountain climber in a variety of valleys, grooves, hills, and cul-de-sac, and it's easier to get to the top of the mountain from a place like this. So what's the right way to the summit?
The first thing is to create a chart for the lock in your program, with two axes:
L The vertical axis of the chart represents the code. If you are applying a phased schema that has been out of branching (referred to as the request partitioning phase), you may already have such a diagram, as many people have seen in the OSI Seven Layer network protocol architecture diagram.
The horizontal axis of the chart represents the data set. At each stage of the request, there should be a dataset that belongs to that stage.
Now that you have a grid chart, each cell on the diagram represents a specific set of data required for a particular stage. Here are the most important rules to follow: Two requests should not compete unless they require the same data set at the same stage. If you strictly abide by this rule, then you have succeeded half.
Once you have defined the grid diagram above, each type of lock in your system can be identified. Your next goal is to ensure that the identified locks are distributed evenly between the two axes as much as possible, and that this part of the work is relevant to the specific application. You have to be like a diamond cutter, based on your understanding of the program, find the natural "texture line" between the request phase and the DataSet. Sometimes they are easy to find and sometimes difficult to find, and they need to be constantly reviewed to find it. In programming, separating the code into different stages is complicated, and I don't have any good advice, but for the definition of the dataset, there are some suggestions for you:
If you can number the request sequentially, or you can hash the request, or you can associate the request with the object ID, then the data can be better delimited based on these numbers or IDs.
Sometimes, it is advantageous to allocate the request dynamically to the data based on the resource maximization of the data set, relative to the intrinsic attribute of the request. It is as if multiple integer units of the modern CPU know to separate the request.
It is useful to determine that the datasets specified at each stage are not the same, so that data that is contested at one stage will not be contested at another stage.
If you have separated the "lock space" (The actual lock distribution here) in both portrait and landscape, and ensured that the lock is evenly distributed across the grid, congratulations on getting a good solution. Now you're at a good point of climbing, and you've got a gentle slope to the top, but you haven't reached the summit yet. Now is the time to count the lock competition and see how it can be improved. Separate the stages and datasets in different ways, then count the lock competitions until a satisfactory separation is obtained. When you do this, the infinite landscape will be present at your feet.
other aspects
I've covered four major areas that affect performance. However, there are some more important aspects that need to be said that large ownership can be attributed to your platform or system environment:
How does your storage subsystem perform both read-write and small-data read and write on Big data, and read-write and sequential read and write? How do you do with read-ahead and deferred writes?
L How efficient is the network protocol you are using? Can I improve performance by modifying parameters? Is there a method similar to Tcp_cork, msg_push,nagle-toggling algorithm to avoid small message generation?
• Does your system support Scatter-gather I/O (such as Readv/writev)? Using these can improve performance and avoid the hassle of using buffer chains (see the related narrative of the first section of data copy). (Note: In the process of transmitting data in DMA, the source physical address and destination physical address must be contiguous.) However, in some computer systems, such as IA, the continuous memory address is not necessarily continuous, then the DMA transfer is divided into several times to complete. If a single interrupt is initiated after a piece of physical continuous data is transmitted, and the host makes the next physical continuous transfer, this is the Block DMA mode. The Scatter/gather method is different, it uses a list to describe the physical discontinuous memory, and then the first address of the list to the DMA master. After the DMA master transmits a piece of physical continuous data, it does not have to interrupt again, but instead transmits the next piece of physical continuous data according to the linked list, and finally initiates an interrupt. Obviously, Scatter/gather is more efficient than block DMA mode.
l What is the page size of your system? What is the cache size? Is it useful to pair these size boundaries? What is the cost of system calls and context switches?
L Do you know the phenomenon of the starvation of the original language? Does your event mechanism have a "surprise group" problem? Does your wake/sleep mechanism have such bad behavior: When X wakes up Y, the environment switches to Y immediately, but does X still have unfinished work?
I think of a lot of aspects here, I believe you have also considered. In certain circumstances, the application of some of the aspects mentioned here may not be valuable, but it is useful to consider the effects of these factors. If you do not find the instructions in the System manual, then try to find the answer. Write a test program to find out the answer; Anyway, writing such test code is a good skill workout. If you write code that runs on more than one platform, then abstract the relevant code into a platform-related library, and you'll gain an opportunity to get ahead of a platform that supports some of the features mentioned here.
For your code, "know Why", understand the high-level operations, and the costs under different conditions. This differs from the traditional performance analysis, not the specific implementation, but the design. Low-level optimization is always the last straw for crappy design.
(Map Note: The following text is not in the original, this is the translator's rationale for translation)
[Appendix: Odysseus (Odysseus, also translated "Odyssey"), mythology of the king of the island of Iraq, "Iliad" and "Odyssey" the protagonist of the two epic (11th century BC to 9th century, the Greek history is called "Homer era.") Homer Epic, which includes both the Iliad and the Odyssey, is a famous masterpiece of the Ancient world. Odysseus had participated in the famous Trojan War, known for his bravery and resourcefulness in the war, and in order to win the war, he designed and made the famous "Trojan Horse" (later in the West became a "gift for the destruction of the enemy" synonymous). After the destruction of Troy, he experienced many risks on his way home, and Homer's Odyssey was an account of the adventures of Odysseus. The story of "Scylla and Charybdis Disney" is one of the most thrilling and terrifying scenes.
Legend has it, Scylla and Charybdis Disney is the ancient Greek myth of the female demon and the monster, the female demon Skudai in Italy and Sicilia, I. Di between the Strait of a cave, her opposite lived another monster card law Boudis. They do harm to all those who sailed past. According to Homer, the Banshee Scylla has 12 irregular feet, 6 snake-like necks, each with a horrible head on each neck, a jaws, and 3 detox teeth in each mouth, ready to bite the prey. They stir up every day in the strait between Italy and Sicilia, I. Di, and the sailors who pass between the two monsters are unusually dangerous, waiting for a ship to cross the Strait of Sicily. In the middle of the strait, the Charybdis Disney into a large vortex, choppy, splash, 3 times a day from the cliff out, in the retreat will be through the boat all submerged. When Odysseus's ship approached the Budis vortex of the law, it was like a pot of boiling water on a stove, and the waves were monstrous, sparking a snow-white splash all over the sky. When the tide receded, the sea is cloudy, the sound of the Tao is Thunder, earth-shattering. At this time, the Dark Muddy cave at the end of the meeting. As they stared in horror at the dreadful sight, as the helmsman was gingerly driving the boat from the left to the vortex, the Scylla appeared before them, and she held 6 companions in one mouthful. Odysseus saw his companion twisting his hands and feet in the middle of the monster's teeth, and struggled for a moment, and they were chewed and crushed into a bloody regiment. The rest of the people passed the dangerous pass between the Charybdis Diste Vortex and the sea monster Scylla. Later, after all kinds of disasters, finally returned to the hometown-the island of Iraq.
The story is widely circulated in linguistics and translation circles. Barkhudarov, a famous translator of the former Soviet Union, likened "Scylla and Charybdis Disney" to "literal translation and free translation". He said: "Figuratively speaking, translators always have to detour between literal translation and free translation, as if the twists and turns between Scylla and Charybdis Disney, in order to find a narrow but deep enough channel between the two sides of the Taiwan Strait, in order to achieve the ideal destination-the maximum equivalent translation. ”
The famous German linguist, Humboldt, also said something similar: "I am sure that any translation is undoubtedly an attempt to solve the impossible task." Because any translator will encounter a reef and be defeated, they are not due to the very exact form of the original text and undermine the characteristics of the target language, is to take care of the characteristics of the language of the translation of the original text. It is not only difficult to do between the two, but it is simply impossible. ”
For a long time in history, translation can only choose one of two extremes: or this-verbatim translation, or the kind-free translation (paraphrase). It's like Scylla and Charybdis Disney "in translation. Now "Scylla and Charybdis Disney" has become synonymous with the double danger-the sea monster and the vortex, people often say "between Scylla and Charybdis Disney", this is said: on both sides of the danger of siege, metaphor "dangerous", used to refer to the translator in literal translation and free translation between the difficult choice between the repeated decision.
Original address: http://blog.csdn.net/marising/article/details/5186643
"Go" High Performance server architecture (High-performance server Architecture)