What exactly is a process and thread? How does the traditional network service model work? What are the relationships and differences between threads and processes? What time does the IO process occur?
First, Context Switch technology brief
Before we go any further, let's review the various context-switching techniques.
But first, a little bit of terminology. When we say "context", it refers to a state in the execution of the program. Usually we use the call stack to represent this state--the stack records where each invocation level executes, as well as all relevant information about the environment at execution time.
When we say "context switching," We express a technique that switches from one context to another. "Dispatch" refers to the method that determines which context can get the CPU time to go down.
Process
Process is an ancient and typical context system, each process has a separate address space, resource handles, they do not interfere with each other.
Each process has a data structure in the kernel that we call the process descriptor. These descriptors contain the information required by the system management process and are placed in a queue called the task queue.
Obviously, when a new process is created, we need to assign a new process descriptor and assign a new address space (which is consistent with the mapping of the parent address space, but both go into the cow State). These processes require a certain amount of overhead.
Process status
Ignoring the complex state transfer tables of the Linux kernel, we can actually boil down the process state to the three most important states: Ready state, run state, sleep state. This is the three-state conversion diagram in any system book.
Readiness and execution can be converted to each other, basically this is the process of scheduling. When the execution program waits for certain conditions (most typically IO), it falls into a sleep state. When the condition is reached, it is generally automatically entered into readiness.
Blocking
When the process needs to do IO on a file handle, the FD does not have the data to give to him when it is blocked. Specifically, the record XX process is blocked on the xx FD, then the process is marked as sleep state, and dispatched out. When there is data on the FD, such as when the data is sent to the end, the process that blocks on the FD is awakened. The process then enters the ready queue, waiting for the appropriate time to be dispatched.
Wake-up after blocking is also a very interesting topic. When multiple contexts are blocked on an FD (although not uncommon, but you can see an example later), and how many contexts should be awakened when FD is ready? You should traditionally wake up all contexts, because if you wake only one, and that context cannot consume all of the data, it will cause other contexts to be in a meaningless deadlock.
But there is a well-known example of--accept, which is also used to read ready to express the received. What happens if I try to use multiple threads to accept? When there is a new connection, all contexts are ready, but only the first one can actually get the FD, and the others are immediately blocked. This is the surprise group problem thundering herd problem.
The modern Linux kernel has solved this problem by means of a surprisingly simple--accept method with lock.
(inet_connection_sock.c:inet_csk_wait_for_connect)
Thread
Threads are a lightweight process, and in fact, in the Linux kernel, there is almost no difference, except that a thread does not produce a new address space and resource descriptor, but rather a reuse of the parent process.
However, in any case, the thread scheduler and process must fall into the kernel state.
Second, the traditional network service model process model
Assign a process to each customer. The advantage is that business is isolated and errors in one process do not affect the entire system, or even other processes. Oracle is traditionally the process model. The disadvantage is that the allocation and release of processes has a very high cost. So Oracle needs to connect the pool to keep the connection reduced to new and free, while trying to reuse the connection instead of creating a new connection at random.
Threading model
Assign a thread to each customer. The advantage is a lighter weight, faster build and release, and very fast communication between multiple contexts. The downside is that a thread is having problems that can easily crash the entire system.
An example
py_http_fork_thread.py
In this example, the threading pattern and the process pattern can be easily interchanged.
How it works:
- Parent Process Listening Service port
- When a new connection is established, the parent process executes the fork, producing a copy of the child process
- EXEC (for example, CGI) if a child process needs to be
- The parent process executes (ideally, the subprocess should be executed first, since exec executes fast to avoid cow) to the accept, blocking
- Context Scheduler, the kernel Scheduler chooses the next context, if no surprise, it should be the child process just derived
- The subprocess process enters the read processing state, blocks on the read call, and all contexts enter the sleep state
- As the SYN or data message arrives, the CPU wakes up the blocking context (Wait_queue) corresponding to the FD, switches to the ready state, and joins the dispatch queue
- The context continues to execute to the next blocking call, or because the time slice exhaustion is suspended
Evaluation
- Synchronization model, write nature, each context can be as other contexts do not exist the same operation, each read data can be regarded as necessary to read.
- The process model naturally isolates the connection. Even if the program is complex and prone to crashes, it affects only one connection and not the entire system.
- The build and release costs are large (the Process fork and thread pattern cost tests for efficiency tests) and need to be considered for reuse.
- Multi-client communication in process mode is cumbersome, especially when sharing large amounts of data.
Performance
Thread mode, virtual machine:
1: 909.27 2: 3778.38 3: 4815.37 4: 5000.04 10: 4998.16 50: 4881.93 100: 4603.24 200: 3445.12 500: 1778.26 (出现错误)
Fork mode, virtual machine:
1: 384.14 2: 435.67 3: 435.17 4: 437.54 10: 383.11 50: 364.03 100: 320.51 (出现错误)
Thread mode, physical machine:
1: 6942.78 2: 6891.23 3: 6584.38 4: 6517.23 10: 6178.50 50: 4926.91 100: 2377.77
Note that in Python, the Gil is unlocked when a thread falls into the network IO, although there is a Gil. So from the start of the call to the end of the call, minus the time the CPU switches to another context, it can be multi-threaded. The phenomenon is that a short Python CPU usage of more than 100% can be observed in such a situation.
If you execute multiple contexts, you can take advantage of this time. The observed result is that only a single-core python, in a small range, as the number of concurrent increases, the performance will actually rise. If this process is transferred to a physical machine execution, then the basic can not draw such a conclusion. This is mainly because the kernel on the virtual machine is in higher overhead.
Third, c10k problem description
When the number of simultaneous connections is around 10K, the traditional model is no longer applicable. In fact, you can see in the thread switching overhead section of the efficiency test report that the performance is a mess after 1K.
Problems with the process model:
In c10k, it is an unacceptable overhead to start and shut down so many processes. In fact, the simple process fork model should be abandoned when c1k.
Apache's Prefork model is a process pool that uses pre-allocated (pre). These processes are reused. But even with reuse, many of the problems described in this article are still unavoidable.
Problems with threading mode
From any test, it can be shown that threading patterns are more durable and better performing than process patterns. But in the face of c10k still powerless. The question is, where is the problem with threading patterns?
Memory?
Some people may think that the failure of the threading model lies first in memory. If you think so, it must be because you look at the very old material and don't think it over.
You may see that a thread stack consumes 8 m of memory (Linux defaults, ulimit can see it), 512 lines stacks consumes 4G of memory, and 10K threads are 80G. So the first thing to consider is to adjust the stack depth, and consider the problem of burst stack.
It sounds reasonable, the problem is that the--linux stack is allocated memory through the missing pages (how does stacks allocation work in Linux?), not all the stack address spaces are allocated memory. Therefore, 8M is the maximum consumption, the actual memory consumption will be slightly larger than the actual required memory (internal loss, each within 4k). But once the memory is allocated, it is difficult to recycle (unless the thread ends), which is a flaw in threading mode.
The premise of this problem is that the 32-bit address space is limited. Although 10K threads do not necessarily run out of memory, 512 of threads must run out of address space. However, this problem does not exist for the 64-bit system that has become the mainstream now.
Kernel overhead?
The so-called kernel is overhead, which means that the CPU turns from non-privileged to privileged, and does some of the overhead of input checking. These costs vary greatly on different systems.
The threading model is mostly caught in a switching context, so it sounds a bit reasonable to get into a big overhead. In fact, this is not true. When does a thread fall into a switch? Under normal circumstances, it should be the time of Io blocking. Does the same IO amount not need to be trapped in other models? Only the non-blocking model has a large likelihood of returning directly, and there is no context switch.
The base call cost section of the efficiency test report confirms that the kernel overhead on contemporary operating systems is surprisingly small (10 clock cycles this magnitude).
The problem with threading models is high switching costs
Familiar with the Linux kernel should know that the modern Linux scheduler after several stages of development.
- linux2.4 Scheduler
- O (1) Scheduler
- Cfs
In fact, until O (1), the dispatch complexity of the scheduler is independent of the queue length. Before this, too many threads would cause the overhead to increase with the number of threads (not guaranteed to be linear).
The O (1) scheduler appears to be completely unaffected by threading. However, this scheduler has significant drawbacks-it is difficult to understand and maintain, and in some cases causes the interactive program to respond slowly.
CFS uses a red-black tree to manage the ready queue. Each time the dispatch, the context state transitions, will query or change the red-black tree. The cost of the red-black tree is approximately O (LOGM), where m is approximately the number of active contexts (exactly the same as the number of priority contexts), approximately the same as the number of active clients.
Therefore, the O (LOGM) level of overhead occurs whenever a thread attempts to read and write to the network, and when it encounters blocking. It also pays the O (LOGM) level overhead each time a message is received that wakes up blocking the context on the FD.
Analysis
The overhead of O (LOGM) may seem small, but it is an unacceptable overhead. Because IO blocking is an often occurring thing. Overhead occurs each time the IO is blocked. And the decision to activate the number of threads is the user, which is not controlled by us. What's worse, when performance drops, the response slows down. With the same number of users, the active context will rise (because the response slows down). This will further slow down performance.
The point of the problem is that the HTTP service does not need to be completely fair to each user, and occasionally a user's response time is greatly extended to be acceptable. In this case, it is pointless to use a red-black tree to organize the pending FD list (which is actually a contextual list), and to calculate and schedule over and over again.
Iv. Overview of Multiplexing
To break through the c10k problem, it is necessary to reduce the number of active contexts within the system (not necessarily, for example, to switch to a scheduler, such as SCHED_RR with RT), and therefore require a context to process multiple links at the same time. To do this, you must return immediately when the data is read or written by the system call. Otherwise the context continues to block on the call, how can it be reused? This requires the FD to be in a non-blocking state, or the data is ready.
All of the IO operations mentioned above refer specifically to his blocking version. The so-called blocking is that the context waits on the IO call until the appropriate data is available. This pattern gives a sense of "reading data as it must be read." Instead of blocking calls, the context returns immediately. If there is data, bring back the data. If there is no data, bring back the error (Eagain). Therefore, "There is an error, but it does not represent an error."
However, even with non-blocking mode, the notification problem is still around the ready. Without the appropriate ready notification technology, we can only blindly retry in multiple FD until we happen to read a ready FD. The difference of efficiency is conceivable.
In readiness notification technology, there are two large modes-readiness event notification and asynchronous IO. The difference is briefly two points. A ready notification maintains a state that is read by the user. Asynchronous IO is called by the system to call the user's callback function. The ready notification takes effect when the data is ready, and asynchronous IO does not occur until data io is completed.
The mainstream scheme under Linux has always been a ready notification, and its kernel-state asynchronous IO scheme is not even encapsulated in the glibc. Around the ready notification, Linux has proposed three different solutions. We bypass the Select and poll scenarios to see the features of the Epoll scenario.
Also mention a little. Interestingly, when Epoll is used (more precisely in the LT mode), it doesn't matter if FD is non-blocking. Because the Epoll guarantees that the data can be read every time it is read, it does not block on the call.
Epoll
The user can create a new Epoll file handle and associate the other FD with this "Epoll FD". Thereafter, all ready file handles can be read through the Epoll FD.
Epoll has two major modes, ET and Lt. In the LT mode, the full ready handle is read by each read-ready handle. In the ET mode, only the newly-ready handle to the last call is given. In other words, if the ET mode reads a handle one time, the handle has never been read-that is, never from ready to not ready. The handle will never be returned by a new call, even if it is actually full of data-because the handle does not go through the process of becoming ready from the non-ready.
Similar cfs,epoll also uses red-black trees-but all of the FD that is used to organize the addition of Epoll. The Ready list for epoll uses a two-way queue. This facilitates the system to either add an FD to the queue or remove it from the queue.
To further understand the specific implementation of Epoll, you can refer to this Linux under the poll and Epoll kernel source analysis.
Performance
If the non-blocking function is used, there is no blocking IO causing the context to switch, but instead becomes the time slice exhaustion being preempted (in most cases), so the additional overhead of read and write is eliminated. And Epoll's normal operation, are O (1) magnitude. The copy action of Epoll Wait is related to the number of FD currently needed to be returned (it is almost equivalent to the above m in the LT mode, and the ET mode is greatly reduced).
But Epoll has a little bit of detail. Epoll FD Management uses a red-black tree, so O (logn) complexity is required for joins and deletions (n is the total number of connections), and the associated operation must be called once per FD. Therefore, there is still a certain performance problem (ultra-short connection) to establish and close connections frequently under large connections. However, the association operation invocation is less than after all. If it is an ultra-short connection, the TCP connection and the release overhead are difficult to accept, so there is little impact on overall performance.
Inherent defects
In principle, Epoll implements a Wait_queue callback function, so that the principle can listen to any object that activates Wait_queue. But the biggest problem with Epoll is that it can't be used for ordinary files, because ordinary files are always ready-although not when read.
This results in a variety of scenarios based on Epoll, which are still blocked once read into the normal file context. Golang in order to solve this problem, each time you call Syscall, you start a thread independently and invoke it in a separate thread. Therefore, the network does not block when the Golang is in the normal IO file.
V. Brief introduction of several program design models under the notification mechanism of events
One drawback of using the notification mechanism is that the user will fall into a daze after the IO operation--io is not completed, so the current context cannot continue execution. However, due to the requirements of the reuse thread, the current thread also needs to be executed. Therefore, in how to do asynchronous programming, but also differentiated several kinds of schemes.
User-State scheduling
The first thing to know is that asynchronous programming is most often accompanied by a user-state scheduling problem-even without using context technology.
Because the system does not automatically wake up the appropriate context based on the blocking state of the FD, the job must be done by someone else--generally, some kind of framework.
You can imagine an FD map to the object's large map table, and when we learn from Epoll that an FD is ready, we need to wake up an object and let him process the data for the FD.
Of course, the reality will be more complicated. In principle all waits without CPU time need to be interrupted, fall asleep, and be managed by a certain institution, waiting for the right opportunity to be awakened. such as sleep, or file Io, and lock. More precisely, all of the kernel involved in the Wait_queue, in the framework of the need to do this mechanism-that is, the kernel scheduling and waiting to move to the user state.
There is, of course, a reverse scenario--throwing the program into the kernel. Perhaps the most famous example is Microsoft's HTTP server.
This so-called "wake-Up Interruptible object", the most used is the association process.
Co-process
A process is a programming component that enables context switching without falling into the kernel. As a result, we can associate the context object to the FD so that the FD-ready process resumes execution.
Of course, because the current address space and resource descriptor switching needs to be completed by the kernel anyway, it can be dispatched only in a different context in the same process.
How to Do
How is this done?
When we implement context switching in the kernel, we are actually saving all the current registers into memory and then loading another set of registers that have been saved from another piece of memory. For Turing machines, the current state register means the machine state-that is, the entire context. The rest, including memory on the stack, objects on the heap, are accessed directly or indirectly through registers.
But think about it, it doesn't seem like you need to go into the kernel state to change the register. In fact, we used a similar scheme when we switched the user state.
The realization of C coroutine is mostly the process of saving the scene and recovering. Python is the top frame (Greenlet) that holds the current thread.
But very tragically, pure user-state schemes (SETJMP/LONGJMP) are very efficient to perform on most systems, but are not designed for co-processes. SETJMP does not copy the entire stack (most of the coroutine scenarios should not), but only the register state is saved. This causes the new register state to share the same stack as the old register state, thereby destroying each other during execution. The complete coroutine scheme should create a new stack at a specific time.
The better scheme (Makecontext/swapcontext) needs to go into the kernel (sigprocmask), which results in very low performance for the entire call.
The relationship between the thread and the process
First we can make it clear that the process cannot dispatch contexts in other processes. Then, each process to obtain the CPU, must be executed in the thread. Therefore, the number of CPUs that can be leveraged by the association is directly related to the number of threads used to process the coprocessor.
As a corollary, a process that executes in a single thread can be considered a single-threaded application. These processes are not preempted and do not synchronize with the context of other CPUs until they are executed to a specific location (basically a blocking operation). Therefore, in the middle of a process code, there is no possibility of causing a blocking call to execute in a single thread. Then this content can be considered synchronized.
We can often see some of the application of the process, the start is a number of processes. This is not a cross-process scheduling association. In general, this is a large group of FD to a number of processes, each process itself to do fd-the corresponding scheduling.
The framework of the co-process based on readiness notification
- The first step is to wrap the read/write and check back when the read occurs. If it is eagain, then the current thread is marked as blocked on the corresponding FD, and then the dispatch function is executed.
- The dispatch function needs to execute epoll (or fetch the data from the last returned result cache, reducing the number of cores caught) and read a ready FD from it. If not, the context should be blocked to at least one FD ready.
- Finds the corresponding Coprocessor context object for this FD and dispatches the past.
- When a process is dispatched, he should probably be on the way back to the scheduler-that is, when Read/write cannot read the data. Therefore, the read should be retried again, and the failure returns 1.
- If the data is read, it is returned directly.
In this way, asynchronous data read and write actions can become synchronous in our imagination. And we know that the synchronization model can greatly reduce our programming burden.
CPS model
In fact, the model has a more popular name-the callback model. The reason why the CPS so big on the thing, mainly involves a lot of interesting topics.
The first is the approximate process of the callback model. At the time of the IO call, a function is passed in as a return function. When IO ends, the incoming function is called to process the following process. This model sounds quite simple.
Then the CPS. Describe the model in a word-he treats everything as IO, whatever it is, the result is returned by a callback function. From this point of view, the IO callback model can only be regarded as a special case of CPS.
For example, we need to calculate 1+2*3, which we need to write in CPS:
mul(lambda x: add(pprint.pprint, x, 1), 2, 3)
Where mul and add are defined in Python as follows:
add = lambda f, *nums: f(sum(nums))mul = lambda f, *nums: f(reduce(lambda x,y: x*y, nums))
And because Python does not have a TCO, this writing creates a lot of frame.
But to understand the model correctly, you need to think about the following questions:
- Why does the calling procedure for a function have to be a stack?
- What time does the IO process occur? When the call occurs, or when the callback is called?
- Where is the callback function called? What would the call stack look like if the tool was used to look at the context?
function components and return values
Don't know if you've ever thought about why a function call hierarchy (context stack) is expressed as a stack--is there any need to define the procedure for a function call as a stack?
The reason is the return value and the synchronization order. For most functions, we need to get the return value of the function calculation. To get the return value, the caller must block until the callee returns. The execution state of the caller must therefore be saved and then resumed when the callee returns-in this case, the invocation is actually the simplest context-switching tool. And for a small number of functions that do not need to be returned, we often need his sequential externalities-for example, to kill a process, to open a light, or simply to add a content to an environment variable. The sequential external effect also needs to wait for the callee to return to indicate that the external effect has occurred.
So, what if we don't need a return value and don't need a sequential external effect? For example, start a background program to send data to the peer without the need to guarantee a successful delivery. Or start a data-fetching behavior without ensuring the success of the crawl.
Usually this is the kind of demand that we make up for using a synchronous call--it's not a serious problem anyway. But for situations where congestion is quite serious, many people still consider this behavior as an asynchronous process. The most popular asynchronous invocation decomposition tool is mq--not only asynchronous, but also distributed. Of course, there is also a simpler non-distribution scheme--to open a coroutine.
CPS is another direction-the return value of a function can be returned to a third party without returning the caller.
What time the IO process occurs
The core of the problem is whether the entire callback model is based on multiplexing or asynchronous IO?
In principle both are possible. You can listen for FD ready, or you can listen for IO completion. Of course, even if the listening Io is complete, it does not mean that the kernel-state asynchronous interface is used. It's probably just a epoll package.
Context of the callback function
This is a problem that needs to be combined with the "User-state scheduling framework" mentioned above. The essence of IO callback registration is to bind the callback function to a certain FD-just as you would bind coroutine. Just coroutine allows you to execute sequentially, while callback will shred the function. Of course, in most implementations, using callback is also beneficial--coroutine the minimum switching overhead is 50ns, and call itself is only 2ns.
State machine Model
The state machine model is a more difficult model to understand and program, and it is essentially a one-time re-entry.
Imagine that you are a patient with periodic amnesia (like "Friends of the Week"). So how do you accomplish a job that needs to cross the cycle? such as embroidery, planting crops, or--making a boyfriend.
Of course, there must be a limit to the analogy to the case of amnesia patients. Normal life skills, and some common sense things must not be within the scope of the cycle of amnesia. For example, to re-learn to read something that no one is affected.
The answer is--take notes. Each time you repeat the amnesia, you need to read your notes, observe which step you did last, and what the next step is. This involves breaking up a work into a number of steps, "re-entering" within each step until the step is complete and moving to the next state.
Similarly, in the state machine model solution, each execution needs to deduce the appropriate state, until the work is completed. This model is rarely used because the state machine model is more difficult to understand and use than the callback function, and the performance difference is small.
Finally incidentally, to pay a boyfriend's plan and a few other slightly different, mainly by the color good high-cold contrast moe, the general people do not try to challenge ... Of course, the average person will not lose memory once a week, after all, life is not a Korean drama or Japanese anime ...
Process, thread, coprocessor, synchronous, asynchronous, callback in Python