Author: sodimethyl this Article Source: http://blog.csdn.net/sodme
Copyright Disclaimer: This article can be reproduced without the consent of the author. However, it is not necessary to retain the copyright, author, and source information of the first two lines of the article.
What are the main advantages of port completion?
The biggest advantage of port completion is its processing efficiency when managing massive connections, and the high Io processing efficiency is achieved through the relevant mechanisms of the operating system kernel. Note: The advantage of the completion port is that the management connection volume is huge, rather than the data volume. In this case, it is best to use the complete port: the connection volume is huge, and the data packets sent and received on each connection are relatively small, usually only a few K or less 1 K bytes.
Since the port processing is a massive connection problem, we should first optimize the port processing on the management of massive connections. To this end, we introduce the concept of "pool.
In the port design, the "pool" principle is almost required. The "pool" contains multiple aspects, including thread pool, memory pool, and connection pool. The following describes the meanings and usage of these "pools" one by one. If possible, I will give a comparison of efficiency when "pool" and "pool" are used, and you will find how cute the "pool" is.
In large online systems, frequent creation and release of data spaces occupy system resources. To this end, we have introduced a "Memory Pool" for data space management ".
As you know, in every wsasend and wsarecv, we have to deliver a struct variable. You can create and release struct variables in multiple ways.
Method 1: an out-of-the-box creation means that a new struct space is declared each time wsasend and wsarecv are executed, and destroyed in the worker thread after get is completed;
Method 2: only when a new connection occurs, we create a new struct space with the new connection and bind it with the new client socket, the client socket is destroyed with the client object only when it is closed;
Method 3: create a certain amount of struct space and put it in a single idle queue. Whenever wsasend and wsarecv are executed, a struct space is obtained from the idle queue, after use, put it back into the idle queue.
In terms of execution efficiency and convenience, we recommend that you manage the idle struct space in the form of method 3. The difficulty in using method 3 is that how many struct spaces are created? This may depend on the processing efficiency of the port you have completed on your server. If the processing efficiency of the completed port is relatively high, the length of the queue may be relatively small, if the processing efficiency of the completed port is low, the queue length will be larger.
Next we will talk about "connection pool ". We know that after receiving a connection from the client using the traditional accept, this function returns a successfully created client socket, that is, this function completes the socket creation. Good news is that Windows provides us with another function that allows us to create a socket before accepting the connection to associate the accepted connection. This function is acceptex, in the acceptex parameter, the socket-type parameter value is a socket created using the socket function in advance. When you call acceptex, pass the created socket to acceptex. Speaking of this, people who understand it may have sneered. In this case, isn't it allowed us to create a lot of sockets waiting for acceptex to be used before accepting the connection? This saves the system overhead of creating a socket temporarily? Well, this is true. In practice, we can create a lot of sockets in advance, and then accept the functions connected by the client to use acceptex. We can call these many sockets as "socket connection pool" (of course, we may hear the concept of "database connection pool" more often ), this name is obtained by myself. If you look uncomfortable, call it "socket pool". Haha.
The concept of thread pool, I believe that friends who have made multiple threads will be heard. In the text here, I will not introduce it more. If you are interested, You can Google it on your own. The thread pool I mentioned here is not only the worker thread pool maintained by the completion port, but also the Thread Pool Based on the completion port model. This thread pool is divided into the execution logic thread pool and the sending thread pool. Of course, it is recommended that you run only one logic thread. There is no need to use multiple threads, and synchronization will be very troublesome. I agree with this statement, but the premise is that such a design requires you to divide the server functions reasonably. Otherwise, this thread will be exhausted.
A friend raised a question about this series of articles and said: I am afraid this cannot be regarded as performance optimization? I want to point out that the optimization mentioned in this series is not just a specific code optimization, but of course there will be such a thing, but the optimization is not just these aspects, the optimization I mentioned here also includes more considerations about the model architecture.
As I mentioned last time, the introduction of "pool" in the model can effectively improve server efficiency. For port completion, it processes thousands of client connections. When a single client is connected, occasional redundant operations may not affect your system, however, this should be avoided for high-performance servers built using ports. Otherwise, you will find that using ports to be completed may not be more efficient than using other models. In addition, it should be pointed out that port completion is not a universal model, and can be used in some cases, while in some cases, it is completely unnecessary, as for the specific use of the completion port in the online game server model, I will mention it in another article.
The introduction of the concept of "pool" mainly aims to maintain a relatively static data storage space while the server is running and perform related operations on this relatively static space. However, even if we do everything possible to introduce "pools" in such areas, we still have to inevitably encounter the following problems: data movement during the package operation and data copying in the memory, one-to-one correspondence and locating between the socket and client objects. Next, we will introduce the optimization details for these three aspects.
For convenience, several state constants are introduced here:
Staccept: indicates that the connection is established;
Strecv: indicates that the instance is in the data receiving status;
Stsend: indicates that the data is being sent.
Note: In this article and subsequent articles, the function "getqueuedcompletionstatus" for short is the "get" function.
Now we want to discuss the problem of Data Packet Assembly when the get function returns in the strecv state. We know that TCP is a stream protocol, and its packet size is not necessarily the size of the logical package we expect to send or receive. What is the size of each sent or received packet, depends on the actual network conditions. A logically complete package may be sent twice by TCP. The first half is sent for the first time, and the second half is sent for the second time, this results in the assembly of incomplete data packets. To assemble data packets, there must be a place where two half-captured packages can be put together, and then the corresponding length of the logical package can be extracted from the header based on the size and definition fields of the logical package. "Put two half-captured packages together", this operation involves the data copy problem. In practical applications, there are two kinds of Assembly solutions available: the first solution is to copy the second half of the received data packet to the end of the first half of the data packet, A continuous data packet space is formed and assembled based on this space. The second solution does not need to copy the second half of the data packet received to the end of the first half, instead, you can directly assemble the existing two buffers. When the first half of the buffer is searched for the truncated position, the pointer is directed to the new package header and a complete logical package is taken away.
Any form of assembly may involve the copy and transfer of the remaining data either early or late. In the first solution, memcpy is executed to copy the newly received data packet to the end of the first half of the data packet. In the second case, although memcpy is not executed first, however, after you perform the "Remove all full logical packages" operation, a new incomplete data packet may be left in the buffer zone, at this time, you still need to perform the memcpy operation to copy the remaining incomplete data packets to the buffer where the first half of the data packets are located, so as to complete the next Assembly operation with the subsequent incomplete data packets. The efficiency of the two solutions is compared. In the second solution, the copied data content may be much smaller when memcpy is executed. Therefore, the efficiency of the second solution is relatively higher.
Of course, if we only want to solve the problem of Data Packet Assembly, we can avoid data copy operations by using a circular buffer. The implementation of the ring buffer is still relatively simple. I will not post the specific code here, just introduce its basic idea. Anyone who has learned the data structure will know something called a "circular queue" (if you do not know it, use Google search ), the circular buffer we mentioned here is the receiving buffer for such features. In the server Receiving Event, when we have completed an operation to remove all the complete logical packages from the buffer zone, new incomplete packages may be left in the buffer zone. After a circular buffer is used, you can not copy the data to the buffer header again to wait for subsequent data assembly, you can perform the next packet Splitting Operation Based on the recorded queue header and tail pointer. Ring buffer is a widely used optimization solution in iocp processing, and even in Receiving Event processing of other network models that need to process data efficiently.
Memcpy function optimization: You can search for related topics on Google. fastmemcpy is the most widely used optimization function. Here we provide two connection addresses for memcpy function optimization:
Http://www.blogcn.com/user8/flier_lu/blog/1577430.html
Http://www.blogcn.com/user8/flier_lu/blog/1577440.html
Please read these two articles carefully. The core idea of optimization is to determine the optimal data copy scheme based on the system hardware architecture. Based on the two connection addresses given above, memcpy will achieve obvious optimization results.
For the network layer, the unique identifier of each client connected to the server is its socket value. However, in many cases, we need a so-called client object that can correspond to this socket one by one, that is: use the socket value to determine whether a unique client object corresponds to it. As a result, we naturally think of using the map in STL to complete this ing. In actual use, we will retrieve the corresponding client object based on the socket value through the map search function. This is a common practice. However, although MAP optimizes its search algorithm, it still takes some time. If we only want to use the get function to return data in the iocp model, determine the client object represented by the socket currently returned, we can set the overlapped Extension Structure to directly locate the client object through the overlapped Extension Structure returned by the get function.
My overlapped structure is as follows:
Struct per_io_data {
Overlapped ov;
.....
Ciocpclient * iocpclient;
.....
}
As you can see, in my extended overlapped structure, I introduced a client Object Pointer: iocpclient. Every time I deliver a wsasend or wsarecv request, this client object pointer will be taken along. In this way, when the wsasend or wsarecv operation is completed, the client Object Pointer can be obtained through the per_io_data structure returned after the get function is executed, thus saving the steps for map search based on the socket value. During continuous data transmission and receiving, if map search is performed frequently in the get function, the performance will inevitably be greatly affected, the direct locating of client objects by passing client pointers to the extended overlapped structure greatly saves the search time overhead and undoubtedly makes another major improvement in performance optimization.
<To be continued>