IOCP programming Summary (medium)

Source: Internet
Author: User

The previous article mainly talked about some basic concepts. This article will talk about some of my personal IOCP programming skills.

 

Requirements and design of front-end servers for online games

First, introduce the technical background of this server. In distributed network game servers, frontend connection is a common design. He has the following responsibilities:

1. provide a software route for the client and the backend game Logic Server-once the client establishes a TCP connection with the front-end server, it can communicate with the backend game server through this connection, instead of establishing a new connection with the backend server.

2. take on the I/o pressure from clients-a typical online game server needs to serve thousands or tens of thousands of game clients (up to 100,000 for casual games, the I/O processing load is considerable. A group of front-end servers can effectively reduce the I/O burden on backend servers, and the backend servers only need to focus on the implementation of game logic, effectively decouples IO and business logic.

Architecture

 

For online games, the client and server need to communicate frequently, but the size of each packet is basically small. The typical size is several bytes to dozens of bytes, at the same time, the user's upstream data volume is much smaller than the downstream data volume. Different game types have different requirements on latency. FPS games require a latency of less than 50 ms, and MMO 100 ~ The latency of about 400 ms for casual card games is acceptable. Therefore, the communication between online games is to optimize the latency while taking into account the merging of packets to prevent network congestion. The main factor is determined based on the specific game type.

This is the technical background. The IOCP connection server described later is designed to meet these requirements.

 

Evaluate the IOCP Server framework

Before implementing this connection to the server, I first checked some existing open-source IOCP Server framework libraries, such as ACE. The entire library is too large and bloated, and the code is also old-fashioned, with no preference. Boost. asio is said to be a good network framework that also supports IOCP. I compiled and ran his example and tried to read the asio code. It felt terrible, I don't know how internal implementation is implemented, so I gave up. I am very opposed to asio, which adheres to the usual abnormal style of boost and puts the C ++ language skills above the design and code readability. Some other non-stream IOCP frameworks have also read some. There are various implementations. The overall feeling is that IOCP is really not easy to grasp and abstract, which leads to various implementations. Finally, I decided to recreate the wheel myself.

 

Service Framework Abstraction

In essence, any server framework encapsulates an Event message loop. The application layer only needs to register the event processing function with the framework, respond to the event and process it. Generally, the synchronous IO Processing Framework receives IO events before performing IO operations. Such event processing frameworks are called Reactor. The special feature of IOCP is that the user initiates an IO operation first and then receives the events completed by IO. The order is the opposite to that of the Reactor. This type of event processing framework is called Proactor. From the roots Re and Pro, we can easily understand the differences between the two. In addition to network I/O events, the server should be able to respond to Timer events and user-defined events. The Framework puts all these events in a message queue, extracts the events from the queue, and calls the corresponding event processing function.

IOCP provides us with a system-level Message Queue (called Completion queue). The event loop is centered around this Completion queue. After initiating an IO operation, the system will perform asynchronous processing (if it can be processed immediately, it will also be processed directly). After the operation is completed, a message will be automatically delivered to the queue, whether it is direct processing or asynchronous processing, the message is always delivered.

By the way: there is an opportunity for performance optimization: When I/O operations can be completed immediately, if the system does not want to deliver messages, you can reduce one System Call (which can save at least several microseconds) by calling SetFileCompletionNotificationModes (handle, FILE_SKIP_COMPLETION_PORT_ON_SUCCESS). For details, refer to MSDN.

User-Defined events can be shipped using Post. For Timer events, my approach is to implement a TimerHeap data structure, and then regularly check the TimerHeap in the message loop to schedule time-out Timer events.

The message returned when IOCP completes the queue is an OVERLAPPED struct and a ULONG_PTR complete_key. Complete_key is bound when the user associates the Socket handle with IOCP. Its practicality is not great, and the OVERLAPPED struct is set when the user initiates an IO operation, in addition, the OVERLAPPED structure can be extended by the user through inheritance. Therefore, how to make good use of the OVERLAPPED structure in the screw shell is the key to encapsulating IOCP.

Here, I used a C ++ template technique to extend the OVERLAPPED structure. first look at the Code:

Struct IOCPHandler
{
Virtual void Complete (ULONG_PTR key, DWORD size) = 0;
Virtual void OnError (ULONG_PTR key, DWORD error ){}
Virtual void Destroy () = 0;
};

Struct Overlapped: public OVERLAPPED
{
IOCPHandler * handler;
};

Template <class T>
Struct OverlappedWrapper: T
{
Overlapped overlap;

OverlappedWrapper (){
ZeroMemory (& overlap, sizeof (overlap ));
Overlap. handler = this;
}

Operator OVERLAPPED * () {return & overlap ;}
};

IOCPHandler is the interface of the user object. You can extend this interface to implement IO to complete event processing. Then, a template class of OverlappedWrapper <T> encapsulates the user object and OVERLAPPED structure into an object. The T type is the user's extended object. Because the user object is located before the OVERLAPPED structure, therefore, we will pass the OVERLAPPED pointer to the IO operation API. At the same time, we place a user object pointer after the OVERLAPPED structure. When GetQueuedCompletionStatus receives the OVERLAPPED struct pointer, through this pointer, we can find the location of the user object, and then call the virtual function Complete or OnError.

The object structure is illustrated as follows:

Processing Method in the event loop: DWORD size;
ULONG_PTR key;
Overlapped * overlap;
BOOL ret =: GetQueuedCompletionStatus (_ iocp, & size, & key, (LPOVERLAPPED *) & overlap, dt );
If (ret ){
If (overlap = 0 ){
OnExit ();
Break;
}
Overlap-> handler-> Complete (key, size );
Overlap-> handler-> Destroy ();
}
Else {
DWORD err = GetLastError ();
If (err = WAIT_TIMEOUT)
UpdateTimer ();
Else if (overlap ){
Overlap-> handler-> OnError (key, err );
Overlap-> handler-> Destroy ();
}
}

Here we use the C ++ polymorphism to extend the OVERLAPPED structure. At the framework layer, we do not need to worry about what I/O events are received, but only need to care about the application layer, at the same time, it also avoids the ugly and difficult expansion switch .. case structure.

For asynchronous operations, the most painful thing is to forcibly split the code of the original sequence logic into multiple back-and-forth calls, which breaks down the original sequence logic in the code, in addition, context variables in each code block cannot be shared. Therefore, an object must be generated to place these context variables, which leads to an issue of object lifetime management, it is especially painful for C ++ without GC. There are currently two solutions to solve the pain of asynchronous logic: one is to use coroutine (Collaborative thread) to convert asynchronous logic into synchronous logic, and Fiber can be used on Windows to implement coroutine; another solution is to use closures, which are originally a feature of functional languages and are not found in C ++, fortunately, we can simulate closure behavior in a slightly more troublesome way. Coroutine is the best tool in solving asynchronous logic, especially when multiple asynchronous operations need to be performed in a function sequentially (in this case, the closure is also inferior ), on the other hand, the implementation of coroutine is complicated, and manual scheduling of threads often throws people. It is a bit difficult to use IOCP asynchronous operations. Finally, I decided to use C ++ to simulate the closure behavior.

The following code demonstrates a typical asynchronous IO usage:

Example of asynchronous sending: void Client: Send (const char * data, int size)
{
Const char * buf = AllocSendBuffer (data, size );

Struct SendHandler: public IOCPHandler
{
Client * client;
Int cookie;

Virtual void Destroy () {delete this ;}
Virtual void Complete (ULONG_PTR key, DWORD size ){
If (! Client-> CheckAvaliable (cookie ))
Return;
Client-> EndSend (size );
}
Virtual void OnError (ULONG_PTR key, DWORD error ){
If (! Client-> CheckAvaliable (cookie ))
Return;
Client-> OnError (E_SocketError, error );
}
};

OverlappedWrapper <SendHandler> * handler = new OverlappedWrapper <SendHandler> ();
Handler-> cookie = _ clientId;
Handler-> client = this;
Int sent = 0;
Error e = _ socket. AsyncSend (buf, size, * handler, & sent );
If (e. Check ()){
LogError2 ("SendAsync Failed. % s", FormatAPIError (_ socket. CheckError (). c_str ());
Handler-> Destroy ();
OnError (E_SocketError, _ socket. CheckError ());
}
Else if (sent = size ){
Handler-> Destroy ();
EndSend (size );
}
}

In this example, we define a SendHandler object in the function to simulate the behavior of a closure. We can place the necessary context variables in SendHandler, you can access these variables in the next Callback. In this example, a cookie is recorded in SendHandler. The function of this cookie is that when an asynchronous operation returns, this Client object may have been recycled, at this time, if you call EndSend again, it will inevitably lead to incorrect results. Therefore, we use cookies to determine whether the Client object is the Client object at the time of the asynchronous operation.

Although the closure is not as elegant as the coroutine sequence logic structure, it is also convenient for you to concatenate various asynchronous callback codes and share the necessary context variables in the closure. In addition, the latest C ++ standard provides native support for closures, which makes it easier to implement. If your compiler is new enough, you can try to use the new C ++ feature.

 

  

IO working thread single thread vs Multithreading

In most articles about IOCP, we recommend that you use multiple worker threads to handle IO events and set the number of worker threads to 2 times the number of CPU cores. According to my impression, this statement came from Microsoft's early official documents. However, in my opinion, this is completely misleading. IOCP is designed to process IO events with as few threads as possible. Therefore, it is no problem to use a single thread for processing, which can simplify the implementation. On the contrary, if multiple threads are used for processing, you must always be careful about thread security. At the same time, the issue of locking is also involved. Improper locking may lead to a sharp reduction in performance, it is worse than a single-threaded program. Some may think that multithreading can take advantage of multi-core CPU, but the current CPU speed is enough to handle IO events, generally, a single core of a modern CPU needs to handle I/O events of a gigabit Nic more than enough. At most, I/O events of two NICs can be processed at the same time, and the bottleneck is usually on the NIC. If you want to increase the I/O throughput by using multiple NICs, we recommend that you use multiple processes for horizontal scaling. Multi-processes can be scaled not only on a single physical server, it can also be extended to multiple physical servers, which is more scalable than multithreading.

At that time, Microsoft put forward this suggestion mainly considering that in addition to IO processing, there are also business logic to be processed in the IO thread, and multithreading can solve the problem of business logic blocking. However, putting the business logic in the I/O thread is not a good design mode, which does not properly decouple the I/O from the business, but also limits the scalability of the server. A good design should decouple I/O from the business, and put the business logic in another process or thread for processing using multiple processes or threads. the I/O thread only needs to be responsible for the simplest I/O processing, and forward the received message to the process or thread of the business logic. My front-end connection server also follows this design method.

Disable the sending buffer to implement your own nagle Algorithm

The biggest advantage of IOCP is its flexibility. It is an example to disable the sending buffer on the socket. Many people think that the value of disabling the sending buffer can reduce the overhead of a memory copy. In my opinion, this is just a piece of sesame. The maximum data throughput of mainstream Gigabit NICS is less than 120 MB/s, while the throughput of memory data copying is more than 10 Gbit/s, and 120 Mbit/s data copying is performed at one time, only 1% of memory bandwidth is consumed, which is of limited significance.

In general Socket programming, we only need to open the nagle algorithm or do not open the choice, the Policy Selection and parameter fine-tuning is not possible. After we disable the sending buffer, each Send operation will wait until the data is sent to the protocol stack of the other party and the message will be returned after receiving ACK confirmation, this gives us a chance to implement a custom nagle algorithm. For online games that require frequent sending of small data packets, enabling the nagle algorithm can effectively combine and send small data packets to reduce the network I/O burden. However, it also increases latency and adversely affects gameplay. With the feature of disabling sending buffering, we can determine the implementation details of the nagle Algorithm on our own. Before the last send operation is complete, we can decide whether to send new data immediately (to reduce latency), or accumulate data and wait until the last send ends or times out before sending. A more complex strategy is to allow the server to tolerate multiple unfinished send operations. When a threshold value is exceeded, the server accumulates data so that the IO throughput and latency reach a reasonable balance.

 

Sending Buffer Allocation Policy

As mentioned above, disabling the socket sending buffer involves how we allocate the sending buffer ourselves.

One strategy is to assign a fixed-size ring buffer to each Socket. This will cause a problem: when the size of the unsent data accumulated in the buffer zone plus the size of the newly sent data exceeds the size of the buffer zone, this will cause problems, either block to wait for the previous data to be sent (but the IO thread cannot block), or simply close the Socket directly. A compromise is to set the sending buffer to a greater value as much as possible, however, this will waste a lot of memory.

Another policy is to allow all client sockets to share a very large ring buffer. Suppose we reserve a 1 GB memory area for this ring buffer, the memory is allocated from the ring buffer every time data needs to be sent to the client. When the buffer is allocated to the end, it is re-allocated until the beginning. Because the buffer zone is very large, it takes at least 10 s for the Gigabit Nic to complete sending of 1 GB memory, and the time will be far more than 10 s in actual application. Therefore, when new data is allocated from the beginning, the old data is already sent out, so you don't have to worry about overwriting the old data, even if the network is blocked, if a data packet has not been sent for more than 10 s, we can also determine to take the initiative to close this socket through timeout.

 

Socket pool and Object pool Allocation Policy

Allowing socket reuse is another advantage of IOCP. When the server is started, we can allocate all socket resources based on our expectation of the maximum number of users. Generally, each socket must correspond to a client object to record some client information. This object pool can also be bound to the socket and pre-allocated. Before the service is run, the memory resources of all the large objects are pre-allocated. A FreeList is used to allocate the object pool. After the client is deprecated, the resources are recycled to the pool. In this way, you can avoid Dynamic Allocation of large objects during service running, and some small objects that need to be temporarily allocated (such as OVERLAPPED structure ), we can use a general-purpose memory distributor such as tcmalloc. tcmalloc uses a small object pool algorithm internally, which provides excellent Allocation performance and stability, and its interface is non-intrusive, we can still keep malloc/free and new/delete in the code. Many services suffer from problems such as low running efficiency and high memory usage after long-term operation, which are related to frequent memory allocation and release, resulting in a large amount of memory fragments. Therefore, it is essential to manage the memory allocation of servers.

 

To be continued ....

 

In the next article, we will analyze the server performance and bottlenecks through several stress tests and profiling examples.

 

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.