Process, thread, coroutine, synchronous, asynchronous, callback, and Python callback in python
What are processes and threads? How does the traditional network service model work? What are the relationships and differences between coroutine and thread? When does the IO process occur?
I. Brief Introduction to context switching technology
Before proceeding, let's review various context switching technologies.
However, first, I would like to describe some terms. When we say "context", it refers to a state in which the program is being executed. Generally, we use the call stack to indicate this state. The stack records the invocation location of each invocation level, the execution environment, and other information.
When we say "context switching", we express a technology that switches from one context to another. "Scheduling" refers to the method that determines which context can obtain the next CPU time.
Process
A process is an ancient and typical context system. Each process has an independent address space and resource handle. They do not interfere with each other.
Each process has a Data Structure in the kernel, which we call process descriptor. These descriptors contain the information required by the system to manage processes and are placed in a queue called a task queue.
Obviously, when creating a process, we need to allocate a new process Descriptor and allocate a new address space (which is consistent with the ing of the parent address space, but both enter the COW state at the same time ). These processes require certain overhead.
Process status
Ignoring the complex state transition table in the Linux kernel, we can actually attribute the Process status to three major states: Ready, running, and sleep. This is the three-State Conversion Diagram in any system book.
Readiness and execution can be converted to each other. Basically, this is the scheduling process. When the execution State program needs to wait for certain conditions (the most typical is IO), it will fall into sleep state. After the conditions are met, the system automatically enters the ready state.
Blocking
When the process needs to perform IO on a file handle and the fd does not have data for it, blocking will occur. Specifically, it is to record that the XX process is blocked on the XX fd, mark the process as a sleep state, and schedule it out. When there is data on fd (for example, the data sent from the peer end arrives), the process blocking on fd will be awakened. The process then enters the ready queue, waiting for the appropriate time to be scheduled.
Wake-up after blocking is also an interesting topic. How many contexts should be awakened when multiple contexts are blocked on one fd (though rare, but we can see an example later) and fd is ready? Traditionally, all contexts should be awakened, because if only one context is awakened, and this context cannot consume all data, other contexts will be in unnecessary deadlocks.
However, there is a famous example-accept, which uses read-ready to indicate receipt. What happens if I try to use multiple threads for accept? When there is a new connection, all the contexts are ready, but only the first one can actually get the fd, and the other is immediately blocked after being scheduled. This is the shocking issue, thundering herd problem.
The modern Linux kernel has solved this problem, and the method is incredibly simple-the accept method locks.
(inet_connection_sock.c:inet_csk_wait_for_connect)
Thread
A thread is a lightweight process. In fact, there is almost no difference between the two in the Linux kernel. Apart from this, the thread does not generate a new address space and resource descriptor table, but reuse the parent process.
But in any case, the thread scheduling and process must be in the kernel state.
2. Traditional Network Service Model Process Model
Assign a process to each customer. The advantage is business isolation. Errors in a process do not affect the entire system or even other processes. Oracle is traditionally a process model. The disadvantage is that there is a very high cost for process allocation and release. Therefore, Oracle needs a connection pool to keep connections between creation and release, while reusing connections as much as possible rather than creating new connections at will.
Thread Model
Allocate a thread to each customer. The advantage is that it is lighter, faster to establish and release, and faster to communicate between multiple contexts. The disadvantage is that a thread is prone to crash the entire system.
Example
py_http_fork_thread.py
In this example, the thread mode and process mode can be easily exchanged.
How to work:
Rating
- Synchronization Model, writing is natural, and each context can be treated as other contexts without the same operation, and each time data is read, it is inevitable that data can be read.
- The Process Model naturally isolates the connection. Even if the program is complex and crash-prone, only one connection is affected, not the whole system.
- The overhead of generation and release is very high (the efficiency test process fork and the thread mode overhead test), and reuse needs to be considered.
- Multi-Customer communication in the process mode is troublesome, especially when a large amount of data is shared.
Performance
Thread mode, virtual machine:
1: 909.27 2: 3778.38 3: 4815.37 4: 5000.04 10: 4998.16 50: 4881.93 100: 4603.24 200: 3445.12 500: 1778.26 (error)
Fork mode, virtual machine:
1: 384.14 2: 435.67 3: 435.17 4: 437.54 10: 383.11 50: 364.03 100: 320.51 (error)
Thread mode, physical machine:
1: 6942.78 2: 6891.23 3: 6584.38 4: 6517.23 10: 6178.50 50: 4926.91 100: 2377.77
Note that in python, GIL is unlocked when a thread is stuck in network I/O. Therefore, from the start of the call to the end of the call, minus the time when the CPU switches to another context, multithreading is supported. In this case, we can see that the usage of a short python CPU exceeds 100%.
If you execute multiple contexts, you can make full use of this time. The observed result is that python can only be single-core. In a small range, the performance will increase with the increase in concurrency. If this process is transferred to a physical machine for execution, it is basically impossible to come to the conclusion. This is mainly because the kernel on the virtual machine has a higher overhead.
Iii. C10K Problem Description
When the number of simultaneous connections is about 10 KB, the traditional model is no longer applicable. In fact, we can see in the thread switching overhead section of the Efficiency Test Report that the performance is poor after 1 K is exceeded.
Process Model problems:
It is unacceptable to start and close so many processes at C10K. In fact, the simple process fork model should be abandoned in C1K.
The prefork model of Apache uses a pre-allocated process pool. These processes are reused. However, even if it is reused, many problems described in this article are inevitable.
Thread mode Problems
From any test, it can be indicated that the thread mode is more durable than the process mode, and the performance is better. However, C10K is still insufficient. The problem is, what is the problem with the thread mode?
Memory?
Some people may think that the failure of the thread model lies in the memory first. If you think so, it must be because you have read old materials and have not thought carefully.
As you may see, a thread stack consumes 8 MB of memory (the default value in linux is ulimit). 512 thread stacks consume 4 GB of memory, and threads consume 80 GB of memory. Therefore, we must first consider adjusting the stack depth and the stack explosion issue.
It sounds reasonable. The problem is that the linux stack allocates memory through page missing (How does stack allocation work in Linux ?), Not all stack address spaces are allocated with memory. Therefore, 8 m is the maximum consumption, and the actual memory consumption is slightly larger than the actual memory (internal consumption, each of which is less than 4 K ). However, once the memory is allocated, it is difficult to recycle it (unless the thread ends), which is a defect in the thread mode.
The premise is that the 32-bit address space is limited. Although the memory may not be exhausted by 512 threads, the address space will be exhausted by threads. However, this problem does not exist for the mainstream 64-bit systems.
Kernel overhead?
The so-called kernel overhead refers to the overhead of CPU switching from non-privileged to privileged and checking input. These overhead vary greatly in different systems.
The thread model is mainly involved in switching context, so it sounds a bit reasonable to fall into overhead. In fact, this is also not true. When will a thread be switched? Under normal circumstances, it should be the time of IO blocking. Other models do not need to fall into the same IO volume? However, a non-blocking model may return directly without context switching.
The basic call overhead section in the efficiency test report confirms that kernel overhead on contemporary operating systems is incredibly small (10 clock cycles at this magnitude ).
The thread model has a high switching cost.
Familiar with linux kernel, we should know that modern linux schedulers have developed through several stages.
In fact, the scheduling complexity of the scheduler is irrelevant to the queue length until O (1. Before that, too many threads will increase the overhead with the number of threads (not linear ).
The O (1) scheduler seems to be completely unaffected by the thread. However, this scheduler has a significant disadvantage-it is difficult to understand and maintain, and in some cases it may lead to slow response of interactive programs.
CFS uses the red/black tree to manage ready queues. During each scheduling and context state conversion, the red/black tree is queried or changed. The overhead of the red/black tree is about O (logm), where m is about the number of active contexts (precisely the number of context with the same priority), which is about the same as the number of active clients.
Therefore, every time a thread attempts to read and write the network and encounters congestion, the O (logm) level overhead will occur. In addition, every time a message is received and the context blocked on fd is awakened, the O (logm) level overhead is also required.
Analysis
O (logm) does not seem to have a large overhead, but it is an unacceptable overhead. IO blocking is a common occurrence. Each IO blocking will incur overhead. The number of active threads is determined by the user, which is not controllable. Worse, when the performance declines and the response speed decreases. Under the same number of users, the active context will increase (because the response is slow ). This will further reduce performance.
The key to the problem is that the http service does not need to be completely fair to every user. it is acceptable that the response time of a user is greatly extended. In this case, it is unnecessary to use the red/black tree to organize the fd list to be processed (in fact, the context list) and calculate and schedule the list repeatedly.
Iv. multiplexing
To break through the C10K problem, you must reduce the number of active contexts in the system (not necessarily, for example, a scheduler, for example, SCHED_RR Using RT). Therefore, a single context must process multiple links at the same time. To achieve this, the system must return immediately when it reads or writes data every time it calls the system. Otherwise, how can we reuse context blocking on calls? This requires that fd be in a non-blocking state or data is ready.
All the IO operations mentioned above actually refer to their blocking versions. Blocking means that the context waits on the IO call until appropriate data is available. This mode gives you the feeling that you can read data as long as you read data. A non-blocking call means that the context returns immediately. If there is data, bring back the data. If no data exists, an error (EAGAIN) is returned ). Therefore, "although an error occurs, it does not mean an error ".
However, even if the non-blocking mode is available, the ready notification problem still persists. If no ready notification technology is available, we can only blindly Retry Multiple fd until we happen to read a ready fd. The difference in efficiency can be imagined.
There are two major methods in readiness notification technology: Readiness Event Notification and asynchronous IO. There are two differences. The ready notification maintains a state, which is read by the user. The system calls the user's callback function for asynchronous IO. The ready notification takes effect when the data is ready, while the asynchronous IO callback does not take effect until the data IO is complete.
Mainstream linux Solutions have always been ready for notifications, and their kernel-state asynchronous IO solutions are not even encapsulated in glibc. Around the ready notification, linux has proposed three solutions in total. Let's bypass the select and poll schemes and look at the features of the epoll scheme.
Another point. Interestingly, when epoll is used (more accurately, only in the LT mode), it doesn't matter whether fd is non-blocking. Because epoll ensures that data can be read every time it is read, it will not block the call.
Epoll
You can create an epoll file handle and associate other fd with this epoll fd. Then, all ready file handles can be read through epoll fd.
Epoll has two modes: ET and LT. In LT mode, the complete ready handle is read every time the read-ready handle is obtained. In ET mode, only the newly ready handle from the last call to the current call is provided. In another way, if a handle is read for a certain time in ET mode, the handle has never been read-that is, it has never changed from ready to not ready. Then the handle will never be returned by a new call, even if it is filled with data-because the handle cannot go through the process from not ready to ready.
Similar to CFS, epoll also uses the red/black tree-but it is used to organize all fd joining epoll. The epoll readiness list uses two-way queues. This allows the system to add or remove an fd from the queue.
For more information about epoll implementation, refer to the linux poll and epoll kernel source code analysis.
Performance
If a non-blocking function is used, context switching will not be caused by blocking IO, but will be preemptible (in most cases), so the additional read/write overhead will be eliminated. Epoll's common operations are O (1) magnitude. The copy action of epoll wait is related to the number of fd to be returned (in the LT mode, it is almost equivalent to the above m, and in the ET mode, it will be greatly reduced ).
However, epoll has some details. Epoll fd management uses a red/black tree. Therefore, the O (logn) Complexity (n is the total number of connections) is required for adding and deleting epoll fd, and the association operation must be called once for each fd. Therefore, frequent connection establishment and disconnection in the case of large connections still have certain performance problems (ultra-short connections ). However, associated operation calls are relatively small. If the connection is really ultra short, it is difficult to accept the tcp connection and release overhead, so it has little impact on the overall performance.
Inherent defects
In principle, epoll implements a wait_queue callback function, so in principle, it can listen to any object that can activate wait_queue. However, the biggest problem with epoll is that it cannot be used for common files, because common files are always ready-although not during reading.
This leads to epoll-based solutions. Once you read the normal file context, it will still block. To solve this problem, golang independently starts a thread and calls it in an independent thread every time syscall is called. Therefore, golang does not block the network when I/O common files.
V. Brief description of several programming models under the Event Notification Mechanism
One major drawback of using the notification mechanism is that the user will be confused after performing IO operations-IO is not completed, so the current context cannot continue to execute. However, due to the need to reuse threads, the current thread still needs to be executed. Therefore, there are several different solutions for asynchronous programming.
User-mode Scheduling
The first thing you need to know is that asynchronous programming is usually accompanied by user-state scheduling problems, even if context technology is not used.
Because the system will not automatically wake up the appropriate context based on the obstruction of fd, this work must be done by others-generally a framework.
You can imagine a large map table mapped by fd to an object. When we know from epoll that a fd is ready, we need to wake up an object and let it process data corresponding to fd.
Of course, the actual situation will be more complicated. In principle, all waits that do not take up CPU time need to be interrupted, sleep, and managed by a certain organization, waiting for a proper opportunity to be awakened. For example, sleep, file IO, and lock. To be more precise, all requests involving wait_queue in the kernel need such a mechanism in the framework-that is, to schedule and wait for the kernel to move to the user State.
Of course, there is actually a reverse solution-throwing the program into the kernel. The most famous example is Microsoft's http server.
This so-called "can wake up the interrupt object", the most commonly used is the coroutine.
Coroutine
Coroutine is a programming component that enables context switching without being stuck in the kernel. In this way, we can associate the coroutine context object to fd, so that the coroutine can be resumed after the fd is ready.
Of course, because the switch between the current address space and the resource descriptor needs to be completed by the kernel in any case, the coroutine can only be scheduled in different contexts in the same process.
How to achieve
How is this done?
When we implement context switching in the kernel, we actually save all the current registers to the memory, and then load another set of saved registers from the other memory. For the Turing machine, the Current Status Register indicates the machine status, that is, the entire context. Other contents, including stack memory and objects, are accessed directly or indirectly through registers.
But think about it. It seems that the Register does not need to enter the kernel state. In fact, we used a similar solution during user mode switching.
The implementation of C coroutine is basically a process of saving on-site and recovery. Python stores the top frame (greenlet) of the current thread ).
However, it is very tragic that the pure user mode solution (setjmp/longjmp) is very efficient in most systems, but it is not designed for coroutine. Setjmp does not copy the entire stack (nor should most coroutine solutions do), but only saves the register status. As a result, the new register status shares the same stack with the old register status, causing mutual destruction during execution. The complete coroutine solution should create a stack at a specific time.
A better solution (makecontext/swapcontext) requires entering the kernel (sigprocmask), which leads to a very low performance of the entire call.
Relationship between coroutine and thread
First, we can make it clear that the coroutine cannot schedule the context of other processes. Then, each coroutine must be executed in the thread to obtain the CPU. Therefore, the number of CPUs that a coroutine can use is directly related to the number of threads used to process the coroutine.
As an inference, The coroutine executed in a single thread can be considered as a single-thread application. These coroutines will not be preemptible or have synchronization problems with the context on other CPUs before they are executed to a specific location (basically blocking operations. Therefore, there is no blocked call in the middle of a coroutine code and it is executed in a single thread. The content can be considered as synchronous.
We often see some coroutine applications, which start several processes. This is not a cross-process scheduling coroutine. In general, this refers to distributing a large group of fd to multiple processes, and each process schedules the fd-coroutine by itself.
Coroutine framework based on ready notification
In this way, asynchronous data reads and writes can be synchronized in our imagination. We know that the Synchronization Model will greatly reduce the programming burden.
CPS Model
In fact, this model has a more popular name-callback model. The reason why we talk about the big stuff of CPS is that it involves a lot of interesting topics.
The first is the general process of the callback model. When I/O is called, a function is input as the return function. When I/O ends, the incoming function is called to process the following process. This model sounds simple.
Then there is CPS. Describe this model in one sentence-it regards all operations as IO. Whatever it is, the results will be returned through the callback function. From this perspective, the IO callback model can only be considered as a special case of CPS.
For example, we need to calculate 1 + 2*3 and write this in cps:
mul(lambda x: add(pprint.pprint, x, 1), 2, 3)
Mul and add are defined as follows in python:
add = lambda f, *nums: f(sum(nums))mul = lambda f, *nums: f(reduce(lambda x,y: x*y, nums))
In addition, because python does not have TCO, such writing will produce a lot of frames.
But to understand this model correctly, you need to think carefully about the following questions:
Function components and return values
I wonder if you have thought about why the function call level (context stack) is expressed as a stack-is there any need to define the function call process as a stack?
The reason is the return value and synchronization sequence. For most functions, we need to get the return value of function compute. To get the return value, the caller must block it until it is returned by the caller. Therefore, the invocation status of the caller must be saved, and the invocation will continue after the caller returns. In this case, the invocation is actually the most simple context switching method. For a small number of functions that do not need to be returned, we often need their external sequential effects, such as killing a process, turning on a light, or simply adding a content to the environment variable. The ordered external effect also needs to wait for the caller to return to indicate that this external effect has occurred.
So what if we don't need the returned value or the external effect of the sequence? For example, if you start a background program to send data to the peer end, you do not need to ensure that the sending is successful. Or you can start a data capture operation.
Generally, this requirement is mixed up with a synchronous call. The problem is not serious. However, in cases where blocking is serious, many people still consider making this behavior an asynchronous process. Currently, the most popular asynchronous call decomposition tool is mq-not only asynchronous, but also distributed. Of course, there is also a simpler non-distribution solution-to open a coroutine.
CPS is another direction -- the return value of the function can be returned to a third party instead of the caller.
When the IO process occurs
In fact, the core of this problem is: is the entire callback model based on multiplexing or asynchronous IO?
In principle, both can be used. You can listen to fd Or IO. Of course, even if the listening IO is complete, it does not mean that the kernel-state asynchronous interface is used. It is probably only encapsulated with epoll.
Context of the callback function
This problem needs to be combined with the "user-mode scheduling framework" mentioned above. The essence of IO callback registration is to bind the callback function to a certain fd-just like binding coroutine. Coroutine only allows you to execute in sequence, while callback will chopped the function. Of course, in most implementations, callback is also advantageous-the Minimum Switching overhead of coroutine is also 50ns, while call itself is only 2ns.
State Machine Model
The state machine model is a model that is more difficult to understand and program. Its essence is to re-import each time.
Imagine you are a patient with periodic amnesia (like a friend of a week ). So how can you complete a task that requires a period of time? For example, embroidery, crop, or -- make a boyfriend.
Of course, there must be some limitations in the case of amnesia. Normal life skills, and some common sense things must not be within the scope of periodic amnesia. For example, if you re-learn to recognize words or something, no one will be affected.
The answer is -- take notes. After each recurrence of amnesia, you need to read your notes and observe the steps that were taken last time and the next steps. This requires a job to be divided into multiple steps, and "reentrant" in each step until the step is completed and transferred to the next state.
Similarly, in the state machine model solution, each execution requires a proper State until the work is completed. This model is rarely used, because the state machine model is more difficult to understand and use than the callback function, and the performance difference is not big.
Finally, I would like to mention that the scheme of making a boyfriend is slightly different from that of others, mainly because of the cool face and cool contrast. The average person should not try to challenge it... Of course, the average person will not lose memories once a week. After all, life is neither Korean drama nor Japanese anime...