Chen Shuo (giantchen_at_gmail)
Muduo full article list: http://blog.csdn.net/Solstice/category/779646.aspx
This article introduces the design and implementation of the input and output buffer in muduo.
In this article, buffer refers toApplication LayerBuffer and buffer technology. Buffer refers to muduo: Net: Buffer class.
The first two sections of this article have been posted in muduo english blog http://muduo.chenshuo.com/2011/04/essentials-of-non-blocking-tcp-network.html.Muduo I/O model
Section 6.2 of unpv1 summarizes five Io models on Unix/Linux: blocking, non-blocking, and Io multiplexing) signal-driven and asynchronous ). These are all single-threaded I/O models.
The c10k issue page introduces five I/O policies and takes the thread into consideration. (Now c10k is no longer a problem, and c100k is not a big problem, so c1000k is a challenge ).
In this multi-core era, threads are inevitable. How do I select a thread model for network programming on the server? I agree with the libev author's point of view: one loop per Thread is usually a good model. I have already stated this point more than once. For details, see "common programming models of multi-threaded servers" and "application scenarios of multi-threaded servers".
If the one loop per Thread model is used, the problem of multi-thread Server programming is simplified to how to designEfficient and easy to useAnd then each thread runs an event loop (of course, synchronization and mutex are indispensable ). There are already many mature examples (libev, libevent, memcached, varnish, Lighttpd, and nginx) in "efficiency". I hope muduo can make a difference in "ease of use. (Muduo uses modern c ++ to implement the reactor mode, which is much easier to use than the original reactor .)
Event loop is the core of non-blocking network programming. In real life, non-blocking is almost always used with IO-multiplexing for two reasons:
Therefore, when I mention non-blocking, it actually refers to non-blocking + IO-muleiplexing. It is unrealistic to use either of them. In addition, all "Connections" in this Article refer to TCP connections, and socket and connection can be used interchangeably in the text.
Of course, non-blocking programming is much harder than blocking. For more information, see Chen Shuo's "TCP network programming essence" section in muduo network programming example: preface. Network Programming Based on Event loop and writing a single-thread windows using C/C ++ProgramSimilar: The program cannot be blocked; otherwise, the window will lose the response; in event handler, the program should hand over the control as soon as possible and return the event loop of the window.Why is the application-layer buffer required in non-blocking network programming?
The core idea of non-blocking Io is to avoid blocking on read (), write (), or other IO system calls, so that thread-of-control can be reused to the maximum extent, enables a thread to serve multiple socket connections. Io threads can only block io-multiplexing functions, such as select ()/poll ()/epoll_wait (). In this way, the buffer at the application layer is required, and each TCP socket must have stateful input buffer and output buffer.Tcpconnection must have output buffer
Consider a common scenario: the program wants to send 100 K bytes of data through a TCP connection, but in the write () call, the operating system only accepts 80 K bytes (controlled by TCP advertised window. For details, see tcpv1, because I don't know how long it will wait (It depends on when the other party accepts the data and then slides the TCP Window ). The program should hand over control as soon as possible and return event loop. In this case, what about the remaining 20 K bytes of data?
For an application, it only generates data. It should not care about whether the data is sent at once or several times. The network library should worry about this. The program only needs to call tcpconnection :: send (). The network library is responsible for the final process. The network library should take over the remaining 20 K bytes of data, store it in the output buffer of the TCP connection, and then register the pollout event. Once the socket becomes writable, the data will be sent immediately. Of course, this second write () operation may not completely write 20 K Bytes. If there are any remaining bytes, the network library should continue to pay attention to the pollout event. If 20 K bytes are written, the network library should stop paying attention to pollout to avoid busy loop. (Muduo eventloop uses epoll level trigger. I will discuss the specific reason for this .)
If the program writes 50 K bytes, and the output buffer still has 20 k data to be sent, the network library should not directly call write (), the append of the 50 K data should be written after the 20 k data, and when the socket becomes writable.
If there is still data to be sent in the output buffer, and the program wants to close the connection (for the program, after calling tcpconnection: Send (), he thinks the data will be sent sooner or later ), at this time, the network library cannot immediately close the connection, but wait until the data is sent. See why does muduo's shutdown () not directly close the TCP connection?" In this article.
In summary, to ensure that the program is not blocked in the write operation, the network library must configure the output buffer for each TCP connection.Tcpconnection must have an input buffer
TCP is a non-boundary byte stream protocol. The receiver must process "the received data does not constitute a complete message" and "the data of two messages received at a time. A common scenario is that the sender sends two 10 KB messages (20 kb in total), and the receiver may receive the data as follows:
When the network library processes the "socket readable" event, it must read the data in the socket at one time (from the operating system buffer to the application layer buffer). Otherwise, the Pollin event will be repeatedly triggered, cause busy-loop. (Again, muduo eventloop uses epoll level trigger. I will discuss the specific reason for this .)
Therefore, the network library must cope with the "incomplete data" situation. the received data should be placed in the input buffer first, and the business logic that constitutes a complete message and then notifies the program. This is usually the responsibility of codec. For details, see "TCP subcontracting" in Chen Shuo's muduo network programming example 2: boost. ASIO chat server.Code.
Therefore, in TCP network programming, the network library must configure the input buffer for each TCP connection.
Io in all muduo is a buffer I/O (buffered Io). You will not read () or write () A socket by yourself, but operate only the input buffer and output buffer of tcpconnection. To be more precise, read the input buffer in the onmessage () callback. Call tcpconnection: Send () to indirectly operate the output buffer. Generally, the output buffer is not directly operated.
The prototype of the onmessage () of BTW and muduo is as follows. It can be either a free function or a member function. Anyway, muduo tcpconnection only recognizes boost: function <>.
Void onmessage (const tcpconnectionptr & Conn, buffer * Buf, timestamp receivetime );
For network programs, a simple acceptance test is: each time the input data receives a byte (200 bytes of input data will be received 200 times, each interval of 10 MS ), the functions of the program are not affected. For muduo programs, codec can be used to separate "message receiving" from "message processing ", see Chen Shuo's introduction to "codec" in "Implementing protobuf decoder and message distributor in muduo.
If a network library only provides a buffer equivalent to Char Buf , or does not provide a buffer at all, it only notifies the program of "A socket readable/a socket writable ", i/O buffering is a concern for the program, so it is inconvenient to use such a network library. (You know what I mean .)Buffer requirements
Muduo buffer is designed to address common network programming requirements. I try to find a balance between ease of use and performance. At present, this balance is more inclined to ease of use.
Muduo buffer design points:
Buffer is actually like a queue. It writes data from the end and reads data from the header.
Who will use buffer? Who writes who reads? According to the previous analysis, tcpconnection has two buffer members: input buffer and output buffer.
In fact, for the Customer Code, the customer code is read from the input and written to the output. The opposite is true for tcpconnection.
The following is a class diagram of muduo: Net: buffer. Note that this class chart is slightly different from the actual code for the convenience of subsequent drawing, but it does not affect my opinion.
This section does not describe the functions of each member function. It is left to the muduo network programming example series. The following describes the functions of readindex and writeindex.Buffer: readfd ()
I wrote in The muduo network programming example: Preface
The specific method is to prepare a 65536-byte stackbuf on the stack, and then use readv () to read data. iovec has two blocks. The first block points to the writable bytes in the muduo buffer, the other part points to the stackbuf on the stack. In this way, if not much data is read, all data is read into the buffer. If the length exceeds the number of writable bytes of the buffer, the data is read to the stackbuf on the stack, the program then append the data in stackbuf to the buffer.
Code see http://code.google.com/p/muduo/source/browse/trunk/muduo/net/Buffer.cc#36
In this way, the space on the temporary stack is used to avoid Memory waste caused by opening a huge buffer, and to avoid the overhead of calling read () repeatedly (usually once readv () you can read all the data by calling the system ).
This is a small innovation.Thread security?
Muduo: Net: buffer is NOT thread-safe. This is intended for the following reasons:
If the tcpconnection: Send () call occurs in the IO thread to which the tcpconnection belongs, it will call tcpconnection: sendinloop (), sendinloop () the output buffer will be operated on the current thread (that is, the IO thread). If the tcpconnection: Send () call occurs in another thread, it will not call sendinloop () in the current thread (), instead, the sendinloop () function call is transferred to the IO thread through eventloop: runinloop () (which sounds amazing ?), In this way, sendinloop () will still operate the output buffer in the IO thread, without thread security issues. Of course, the cross-thread function transfer call involves cross-thread transmission of function parameters. A simple method is to copy the data, which is absolutely safe (read the code if you do not understand it ).
Another more efficient approach is to use swap (). This is why a certain overload of tcpconnection: Send () takes buffer * as the parameter, rather than const buffer &, so as to avoid copying, instead of using buffer: swap () efficient data transfer between threads. (In the end, this is only an assumption that has not yet been implemented. Currently, data is still transferred between threads in the data copy mode, with slight performance loss .)Muduo buffer Data Structure
The buffer contains a vector of char, which is a continuous memory. In addition, the buffer has two data members pointing to the elements in the vector. The two indices types are int, not char *, in order to prevent the iterator from being invalid. For the muduo Buffer Design, refer to the channelbuffer of netty and the evbuffer of libevent 1.4.x. However, its prependable can be regarded as "micro-innovation ".
The muduo buffer data structure is as follows:
Two indices divide the content of a vector into three parts: prependable, readable, and writable. The size of each part is (Formula 1):
Prependable = readindex
Readable = writeindex-readindex
Writable = size ()-writeindex
(The Role of prependable will be discussed later .)
Readindex and writeindex meet the following non-variant conditions ):
0 ≤ readindex ≤ writeindex ≤ data. Size ()
Muduo buffer has two constants kcheapprepend and kinitialsize, which define the initial size of prependable and the initial size of writable. (The initial readable size is 0 .) After initialization, the buffer data structure is as follows: the number in the brackets is the value of this variable or constant.
According to the above (Formula 1The size of each block can be calculated. The initialized buffer has no payload data, so readable = 0.Muduo buffer operation 1. Basic read-write cycle
For details about buffer initialization, seeFigure 1If someone writes 200 bytes to the buffer, the layout is:
In Figure 3, The writeindex moves 200 bytes backward, the readindex remains unchanged, and the values of readable and writable also change.
If someone reads () & retrieve () from the buffer,Read") 50 bytes. For the result, seeFigure 4. In contrast, readindex moves 50 bytes backward, writeindex remains unchanged, and the values of readable and writable also change (this sentence will be omitted later ).
Then 200 bytes are written. The writeindex moves 200 bytes backward, and the readindex remains unchanged. SeeFigure 5.
Next, one-timeRead350 bytes. Note that because all data is read, readindex and writeindex are returned for use in a new round. SeeFigure 6, AndFigure 2Is the same.
The above process can be seen as the sender sent two messages in 50 bytes and 350 bytes respectively. The receiver receives the data twice, 200 bytes each time, and then subcontract the data, call back the customer code twice.Automatic Growth
Muduo buffer does not have a fixed length. It can automatically increase, which is a direct benefit of using vector.
Assume that the current status is as follows:Figure 7. (This is the same as Figure 5 above .)
The Customer Code writes 1000 bytes at a time, and the number of currently writable bytes is only 624, so the buffer will automatically increase to accommodate all the data. The result is:Figure 8. Note that readindex is returned to the front to ensure that prependable is equal to kcheapprependable. Because the vector has re-allocated memory, the pointer to its element is invalid, Which is why readindex and writeindex are integer subscripts rather than pointers.
Then read 350 bytes, readindex advances, seeFigure 9.
Finally, after reading the remaining 1000 bytes, readindex and writeindex return kcheapprependable, as shown in figure 10.
Note that the buffer size is not reduced. The memory will not be re-allocated when 1350 bytes are written next time. In other words, the size () of muduo buffer isAdaptiveIts initial value is 1 K. If the program often sends and receives 10 K of data, the size () will automatically increase to 10 K after several times, then it remains unchanged. In this way, avoid wasting memory (some programs may only need 4 K of buffer), and avoid repeated memory allocation. Of course, the customer code can be manually shrink () buffer size ().Size () and capacity ()
Another advantage of using vector is that its capcity () mechanism reduces the number of memory allocations. For example, if the program writes 1 byte repeatedly, the muduo buffer does not allocate memory every time. The capacity () of the vector increases exponentially, so that the average complexity of push_back () is a constant. For example, after the first increase, the size () meets the write requirements, as shown in figureFigure 11. However, at this time, the vector's capacity () is greater than size (), and no memory will be re-allocated when the data written to the capacity ()-size () byte is later. SeeFigure 12.
Careful readers may find that capacity () is not perfect, and there is room for optimization. Specifically, vector: resize () Will initialize (memset/bzero) memory, but we don't need it for initialization, because it will be filled in immediately. For example, if 200 bytes are written based on Figure 12, it is a good thing because capacity () is large enough to avoid re-allocating the memory. But vector: resize () the 200 bytes are set to 0 (Figure 13), and the muduo buffer is then filled in with data (figure 14 ). This is a waste, but I don't plan to optimize it unless it does cause a performance bottleneck. (Readers who are proficient in STL may say that vector: append () is used to avoid waste, but writeindex and size () are not necessarily aligned and there will be other troubles .)
There is a stlstringresizeuninitialized function in Google protobuf. This is what you do.Internal move
Sometimes, after several reads and writes, readindex is moved to a relatively backward position, leaving a huge prependable space. SeeFigure 14.
What if we want to write 300 bytes while writable is only 200 bytes? In this case, muduo buffer does not re-allocate the memory, but first moves the existing data to the front to free up the writable space. SeeFigure 15.
Then, you can write 300 bytes. SeeFigure 16.
The reason for doing so is that if the memory is re-allocated, data will be copied to the newly allocated memory area at a higher cost.Prepend
As mentioned above, muduo buffer has a small innovation (maybe not innovation. I remember where I saw a similar practice and forgot the source), that is, it provides a prependable space, enable the program to focus on data at a very low costFrontAdd several bytes.
For example, the program uses four fixed bytes to indicate the message length (muduo network programming example 2: boost. lengthheadercodec in ASIO Chat Server), I want to serialize a message, but I don't know how long it is, then I can append () until the serialization is completed (Figure 17, Write 200 bytes), and then add the message length (Figure 18, Prepend the number 200 to the header ).
By reserving the kcheapprependable space, you can simplify the Customer Code and change the time for a simple space.Other Design Schemes
Here we will talk about other possible application-layer buffer design solutions.No vector <char>?
If STL cleanup exists, you can manage the memory by yourself, and use four pointers as buffer members. The data structure is shown in Figure 19.
To be honest, I don't think this solution is better than vector. The Code becomes more complex, and the performance has not improved with noticeable.
If you discard the "continuity" requirement, you can use the circular buffer, which can reduce a little memory copy (no "Internal moves ").Zero copy?
If you have high performance requirements and cannot accept copy () and resize (), you can consider implementing a piecewise continuous zero copy buffer and then working with gather scatter Io. The data structure is 20, this is the design scheme of libevent 2.0.x. The mbuf solution in the implementation of BSD TCP/IP introduced by tcpv2 is similar, and the Linux sk_buff is estimated to be similar. The details are different, but the basic idea is to use a linked list to link data blocks instead of requiring data to be continuously stored in the memory.
Of course, the cost of high performance is that the Code becomes obscure and the buffer is no longer consecutive, and parse messages will be slightly troublesome. If your program only processes protobuf message, this is not a problem, because protobuf has the zerocopyinputstream interface. If you implement this interface, parsing will be handed over to protobuf message.Is performance a problem? Compare with others
Some readers may find that muduo buffer has so many optimizations that its performance will be too low? In this regard, my response is "yes, it is not necessarily worth optimization ."
Muduo is designed to develop distributed programs within the company. In other words, it is used to write a dedicated Sudoku server or game server, not to write a common httpd, FTPD, or WWW Proxy. The former usually has business logic, while the latter increases concurrency and high throughput.
Taking Sudoku as an example, if it takes 0.2 ms to solve a Sudoku problem and the server has 8 cores, it is ideal to solve up to 40,000 questions per second. The data size of each Sudoku request is less than 100 bytes (a 9x9 data is only 81 bytes, And the header can be controlled below 100 bytes ), that is to say, the throughput of 100x40000 = 4 MB per second is enough to make the server's CPU saturated. In this case, it seems meaningless to optimize the buffer's memory copy times.
For another example, the most common raw throughput of Gigabit Ethernet is 125 Mb/s. After the Ethernet header, IP header, and TCP Header are deducted, the application layer throughput is about 115 Mb/s. The most common DDR2/ddr3 memory bandwidth on servers is at least 4 Gb/s, which is more than 40 times higher than that of Gigabit Ethernet. That is to say, it is not a problem to copy data of several K or dozens of K in the memory, because it is limited by Ethernet latency and bandwidth, programs on other machines that communicate with this program do not notice performance differences.
In the last example, if the service program you implement needs to deal with the database, the bottleneck is often on the DB, optimization of the service program itself does not necessarily improve performance (Reading data from the database often offsets all the low-level optimization you have done). In this case, it is better to focus on Database optimization.
Another difference between a dedicated service program and a general service program is that the benchmark object is different. If you plan to write an httpd, someone will naturally compare it with the best nginx at present, and immediately compare it with the performance. However, if you write a service program (such as distributed storage, search, Weibo, or short URL) that implements the company's internal business, because there is no open source implementation with the same functions on the market, you don't need to invest all your energy in optimization, as long as one version is better than the other. First, correctly implement the required functions, put them into production and application, and then optimize them based on actual load conditions. This is probably more effective than blindly tuning in the coding stage.
One of muduo's design goals is to make the throughput saturated over Gigabit Ethernet, that is, sending and receiving 120 MB of data per second. This is easy to achieve without any special efforts.
If there is a problem with memory bandwidth, it means that your application is too critical. Maybe you should consider putting it in Linux kernel, rather than trying various Optimizations in user mode. After all, zero copy can be realized only when the program is implemented in the kernel. Otherwise, there is always a memory copy between the core State and the user State. If it cannot meet your requirements, either write a new kernel by yourself, or directly use FPGA or ASIC to operate the network adapter to implement your high-performance server.
(To be continued)