In the post-Reply to "dialog netizen-TCP 10 thousand Connection System Design", short comments are not enough to explain the problem, so I wrote an article separately to explain it.
For general applications, the operating system is sufficient to deal with. For extreme applications, the operating system is often our obstacle. The obstacle here has two meanings. The first significance is, for some reason, it does not prohibit many mechanisms that can improve performance. It does not mean that it limits our thinking.
Because the operating system needs to consider a variety of applications, so a lot of defensive measures are made in the design, and for specific applications, these measures are often not the best, if necessary, we need to customize our own applications, just as Google modifies the Linux File System. To achieve extreme network applications, we need to hack the operating system. For extreme applications, the traditional network programming model is simply not enough.
First, we will list several application scenarios:
Scenario 1: smartbit is a very Nb protocol testing tool, but it can only test some of the most basic protocols. If you want to test a wider range of protocols or custom protocols, you can only write your own test tools, but your own test tools are limited by the operating system. If you use a traditional programming model, it is difficult to achieve ultra-high performance.
Scenario 2: Some applications have fixed logic and high performance requirements, such as dedicated servers, time servers, and DNS servers, large simulation systems, switches, routers, IDs, and so on. Of course, these can be solved through hardware, but is it better to use general-purpose computer software?
For these applications, we need to abandon the traditional concept of network applications. What socket, iocp, and all go to hell-it's not enough. Of course, it can be solved by conventional development. For example, in the scenario mentioned in "dialog user-TCP 10 thousand Connection System Design", no special means are needed. What we will talk about here is what we should do when the conventional method cannot solve the problem and the iocp is only enough to fill the teeth.
======================================
Large applications tend to discard relational databases to improve performance and return to the traditional key-value database.ProgramWe need to abandon the traditional programming model and return to the most traditional and original programming model.
For the operating system, the main impact on network application performance is as follows:
(1) process: the current OS is a multi-task system, and the performance of a single task system is the best.
(2) memory: Packet replication problems. It is unnecessary to copy data from the NIC to the kernel and then to the application program several times.
(3) system calling: System calling is very resource-consuming. Socket access, memory allocation, and machine time retrieval ...... All are system calls.
(4) Programming Model: The iocp model is not optimal. The traditional thread-based programming mode is not enough to deal with large concurrency. For example... To run 10 million threads... In this case, you can only use proto thread (or a lightweight thread called by Erlang ). What if it is higher? 0.1 billion threads? This requires thread elimination, using a completely discrete processing mechanism to completely remove the thread.
======================================
Next, let's take a look at what we can do to squeeze the system to Foxconn's level.
(1) Process
For this reason, it is not cost-effective to create an operating system separately. Let's look at how to solve the problem based on the existing operating system. Obviously, we do not need other applications to interfere with our network applications. Therefore, the APP must be given the highest priority. If the operating system is not hack, the highest priority is the real-time process.
(2) (3) memory and system call
To avoid Memory replication and reduce system calls, we need to talk to the traditional socket about Byebye. The best way is to implement a protocol stack by yourself. Of course, you do not have to implement a full protocol stack.CodeWe only need to implement the part we need. In essence, UDP only requires several hundred lines of code to implement TCP and only 3000 or 4000 lines of code. At the same time, there are also many open-source implementations for reference. This work seems difficult, but not very difficult. Another problem is the memory allocation. The memory allocation in the operating system is inefficient. Therefore, you need to use the object pool to reuse the memory.
(4) Programming Model
Discard threads and adopt the original event-based Discrete processing model. To put it simply, we break down every task that needs to be done into one event and put it in the queue. What about the Application? In the process of pulling events from the queue and executing them, if there are other subsequent operations, you can generate new events and put them at the end of the queue.
If we need some events to be executed first, simple queues will not work and priority queues will be needed. Furthermore, if we need to introduce a time model, we need to schedule an event to be executed at a certain time. What is the priority of the event before the time? What should we do? In the past, heap was widely used. The insert complexity of heap is O (logn), and the biggest element in search is O (1 ). Is heap optimal? No! There is also a more Nb-intensive data structure-calendar queue. The complexity of calendar queue insertion is O (1), and the complexity of searching the largest element is O (1 ). Therefore, we need an event scheduler based on the calendar queue. For more information about the calendar queue, see Calendar Queues: a fast O (1) priority queue implementation for the simulation event set problem. communications of the ACM, 31 (10): 1220-1227, October 1988Article. There are also open-source implementations of the calendar queue scheduler online, just a few hundred lines of code. In this way, the program scheduling is changed from operating system scheduling to self-scheduling, and the scheduling complexity is O (1 ).
One of the biggest problems with event-based processing is that the processing time of each event cannot be too long. Otherwise, the entire execution process is easily blocked. In specific implementation, long events can be divided into discrete execution of small events, which has high requirements on programmers.
======================================
Next we will integrate the above points to see a complete project implementation.
First, in OS selection, windows is not available, and hack is not easy. Select Linux. In fact, real-time Linux should be used, because in common Linux, the priority of real-time processes is lower than that of system processes, but in view of the real-time Linux is not familiar with, it should be used in common Linux. Common Linux is enough.
Then, how do I send and receive packets? Since the traditional socket has been abandoned, you must write a set of sending and receiving mechanisms by yourself. I have tried two mechanisms:
One is a signal-based mechanism. When there is a data packet, the operating system can send a signal to the application, and the application can obtain the signal by itself. The test results show that packet loss is high in the case of large data volumes;
Another mechanism is based on pf_ring socket (http://www.ntop.org/PF_RING.html. Pf_ring creates a ring buffer in the operating system kernel, and then the application periodically scans the ring buffer, that is, it does not need to copy the memory and bypasses the complicated socket mechanism of the operating system, this maximizes performance and minimizes packet loss rates.
After receiving data from the ring buffer, You need to parse the data yourself. There are a lot of lightweight protocol stack code on the Internet. You can change it to use it. The Core Protocol does not have many lines of code. After the resolution is successful, it is packaged into events, and a handler is arranged for each event. Then, the event is hung in the calendar queue to wait for processing.
The performance is super good-the entire processing process is dissolved into O (1), that is, before reaching the hardware limit, system parameters, such as response time and packet loss rate, are only related to throughput and the number of concurrent connections-there is no concurrency concept here, and only packet's sending and receiving and scheduling processes are involved, the main process is O (1) complexity. In actual tests, if the amount of computing on the application layer is small, the limitations mainly apply to the network card. I used only four ordinary PCs, Mbit/s Nic, Mbit/s all full, and the CPU is still very good. If the application layer has a large amount of computing, the main limit is the CPU. In this programming model, the client and server of the QQ Protocol are implemented, and 20 thousand QQ clients are run on one machine. The real UDP-based QQ protocol is used to log on and obtain friends, send information, send group information. Each QQ uses different IP addresses and ports (because the system socket is bypassed, you can use different IP addresses ). Run a simulated QQ server on another machine to receive information from the client, and then send it to the IP address of the client to be sent. Both machines are running 4 machines, and a single thread is running. The CPU is full, and the main CPU computing is concentrated in codec and application logic. Memory usage is only 30 mb, and packet loss is low. The limit here is mainly the CPU. Profile, which only accounts for 20% of the CPU used for network and scheduling.
This programming mode is the most difficult asynchronous programming mode. But through proper encapsulation, the operation can be simplified. If the encapsulation is good, the program will be very happy to write. In reality, every network protocol will be simulated before it is officially submitted, and the simulation of network protocols is based on this model-because of its high performance, flexibility is also the largest. If the app comes with "tailism", many application-layer protocols only need to be modified.
======================================
Of course, this is about System Design in extreme circumstances. In many cases, we need a limit design, but because our thinking may be limited, we can't break our mindset and adopt a general design. This is not the case. Iocp is a general limit, not a hack. The fact is that a slightly better machine can achieve a processing speed of Mbps, Which is packets per second. In this throughput, there may be 1 million sessions (this word is not used here). Each session sends a packet every 10 seconds, or 10 sessions, each session sends 10000 packets per second. The existing operating system APIs are good at the latter, but the actual throughput of the two is the same. The slight difference is that 1 million sessions may occupy more memory resources.
Sometimes, there is no need to design the limit. Naturally, this method is not required.