The TCP/IP protocol involves four layers: link layer, network layer, transmission layer, and application layer.
The Ethernet data frame is at the link layer.
IP packageAt the network layer
TCP or UDP packetsAt the transport layer
Data in TCP or UDP(Data) at the application layer
TheirLink isData frame {IP packet {TCP or UDP packet {Data }}}
---------------------------------------------------------------------------------
The maximum length of Data used in applications depends on the underlying limits.
Let's analyze from bottom to top:
1.Link layerThe length of the data frame is (46 + 18)-(1500 + 18) determined by the physical characteristics of ethernet. 18 is the header and end of the data frame, that isThe maximum data frame content is 1500.(Excluding frame header and frame tail), that is, MTU (Maximum Transmission Unit) is 1500;
2.At the network layerBecause the header of the IP package occupies 20 bytes, the MTU is 1500-20 = 1480;
3.At the transport layerFor the UDP packet header, it takes 8 bytes, so the MTU is 1480-8 = 1472;
Therefore, at the application layer, your Data length is 1472 at the maximum. (When our UDP packet contains more data than MTU (1472), the sender's IP layer needs to fragmentation for transmission, and the receiver's IP layer needs to reorganize the datagram, because UDP is an unreliable transmission protocol, if the part is lost, the reorganization fails, and UDP packets will be discarded ).
From the above analysis, in a common LAN environment, the maximum UDP data size is 1472 bytes (avoid sharding ).
However, in network programming, vrouters on the Internet may be set to different values (less than the default value ),The standard MTU value on the Internet is 576Therefore, the data length in UDP programming on the Internet should be less than 576-20-8 = 548 bytes.
---------------------------------------------------------------------------------
MTU is very important to our UDP programming. How can we view the MTU value of the route?
For windows OS: ping-f-l such as: ping-f-l 1472 192.168.0.1
If the message "Packets needs to be fragmented but DF set." is displayed, the MTU is smaller than 1500, and the data_length value is continuously reduced. Finally, the MTU value of the gateway can be calculated;
For linux OS: ping-c-M do-s, for example, ping-c 1-M do-s 1472 192.168.0.1
If the prompt is Frag needed and DF set ...... It indicates that the MTU is less than 1500. You can test the MTU of the gateway again.
Principle: the ping program uses ICMP packets. The ICMP packet header occupies 8 bytes, and the IP datagram header occupies 20 bytes. Therefore, 28 bytes are added to the data size to indicate the MTU value.
---------------------------------------------------------------------------------
The maximum length of an IP packet is 64 KB (65535 ),Because two bytes are used to describe the packet length in the IP address header, the maximum number of two bytes is 65535.
Because the IP protocol provides the upper-layer protocol to split and reorganize packets, there is no restriction on the packet length of the transport layer protocol in principle. In fact, there are still some restrictions, because the ID field of the IP package cannot be infinitely long after all, according to IPv4, it seems that the upper limit should be 4G (64 K * 64 K ). Relying on this mechanism,There is no "packet length" field in the TCP packet header, but it relies entirely on the IP layer to process frames. This is why TCP is often called a "stream protocol".When using the TCP service, developers do not have to worry about the data packet size. They only need to talk about SOCKET as the entry to a data stream and put data in it, the TCP protocol implements congestion/traffic control.
UDP is different from TCP. The total length field in the UDP packet header is also two bytes. Therefore, the total length of the UDP packet is limited to 65535, which can be placed into an IP packet, this makes the implementation of UDP/IP protocol stack very simple and efficient. 65535 minus the eight bytes occupied by the UDP header. The maximum length of the payload in the UDP service is only 65527. This value is the returned value when you specify SO_MAX_MSG_SIZE when calling getsockopt (). The data sent at a time cannot exceed this value for any socket using the SOCK_DGRAM attribute. Otherwise, an error is returned.
What will happen when the IP package is submitted to the lower-layer protocol? This depends on the data link layer protocol. Generally, the data link layer protocol is responsible for dividing an IP packet into smaller frames and reorganizing it on the target end. On EtherNet, the size of the data link frame is described by the above two heroes. If the IP address is over ATM, the IP packet is split into an ATM Cell with a size of 53 bytes.
Some typical MTU values:
Network: MTU bytes
SuperPass 65535
16 Mbit/s information card ring (IBM) 17914
4 Mbit/s 4464
FDDI 1, 4352
Ethernet 1500
802.2/1492
X. 25 576
Point-to-point (low latency) 296
Path MTU: if the communication between two hosts is over multiple networks, the link layer of each network may have different MTU. What is important is not the MTU value of the network where the two hosts are located, but the minimum MTU in the path of the two communication hosts. It is called the path MTU.
Nagle algorithm in Tcp transmission
In TCP/IP, no matter how much data is sent, always add a protocol header before the data. At the same time, the other party receives the data and also needs to send an ACK to confirm. To make full use of network bandwidth, TCP always wants to send big data as much as possible. (MSS parameters are set for a connection. Therefore, TCP/IP needs to be able to send data with MSS Data blocks each time ). The Nagle algorithm is used to send as much data as possible to avoid the network flooding with many small data blocks.
The basic definition of the Nagle algorithm is any time, and there can be only one unconfirmed small segment at most. The so-called "small segment" refers to a data block smaller than the MSS size. The so-called "unconfirmed" refers to a data block sent out, no ACK sent by the other party is received to confirm that the data has been received.
1. Nagle algorithm rules:
(1) if the package length reaches MSS, sending is allowed;
(2) if the message contains FIN, the message can be sent;
(3) If the TCP_NODELAY option is set, the message can be sent;
(4) If the TCP_CORK option is not set, if all the sent small data packets (the packet length is smaller than MSS) are confirmed, send is allowed;
(5) If none of the above conditions are met, but a timeout occurs (generally 200 ms), it will be sent immediately.
The Nagle algorithm allows only one unack packet to exist in the network, regardless of the packet size. Therefore, it is actually an extended stop-and other protocol, but it is based on packet stop-wait, rather than byte stop-wait. The Nagle algorithm is completely determined by the ACK mechanism of the TCP protocol, which may cause some problems. For example, if the ACK reply to the peer is fast, the Nagle will not splice too many data packets, although network congestion is avoided, the overall network utilization is still low.
The Nagle algorithm is a half set of the silly window syndrome (SWS) prevention algorithm. The SWS algorithm prevents sending a small amount of data. The Nagle algorithm is implemented by the sender. When the receiver wants to do so, do not notice a small increase in the buffer space and do not notify a small window, unless the buffer space increases significantly. Here, a significant increase is defined as a full-size segment (MSS) or half of the maximum window.
Note: the implementation of BSD is to allow the last segment of the large write operation to be sent on the idle link, that is, when more than one MSS Data is sent, the kernel first sends n MSS data packets in sequence, and then sends the small data packets at the end without waiting for delay. (Assume that the network is not blocked and the receiving window is large enough ).
For example, at the beginning, the client calls the socket write operation to write an int-type data (called block A) to the network, because the connection is idle at this time (that is, there are no unconfirmed segments), the int type data will be sent to the server immediately, and then, the client also calls the write operation to write '\ r \ n' (Block B). At this time, the ACK of Block A does not return, so we can think that there is an unconfirmed small segment, so block B is not sent immediately, and it is sent until the ACK of Block A is received (about 40 ms later. The entire process:
The problem is hidden here, that is, why does the ACK of block A receive data 40 ms later? This is because TCP/IP does not only have the nagle algorithm, but also has a TCP validation delay mechanism. After the Server receives the data, it does not immediately send ACK to the client. Instead, it delays ACK sending for a period of time (assumed as t ), it hopes that the server will send the response data to the client within t time, so that ACK can be sent together with the response data, just as the response data volume carries ACK. In my previous time, t was about 40 ms. This explains why '\ r \ n' (Block B) is always issued 40 ms after Block.
Of course, TCP validation delay of 40 ms is not always the same, TCP connection delay confirmation time is generally initialized to a minimum of 40 ms, then according to the connection retransmission timeout (RTO) parameters such as the time interval between the last received data packet and the current received data packet are constantly adjusted. In addition, you can cancel the delay confirmation by setting the TCP_QUICKACK option.
For details about TCP validation delay, refer to: http://blog.csdn.net/turkeyzhou/article/details/6764389
2. TCP_NODELAY option
By default, the Negale algorithm is used to send data. In this way, although the network throughput is improved, the real-time performance is reduced. In some highly interactive applications, the Negale algorithm can be disabled using the TCP_NODELAY option.
In this case, each packet sent by the application to the kernel is immediately sent. Although the Negale algorithm is disabled, the network transmission is still affected by the TCP validation delay mechanism.
3. TCP_CORK option
The so-called CORK is the meaning of the plug-in, and the image is to use the CORK to plug the connection, so that the data is not sent out, and then sent out after the plug-in is pulled out. After this option is set, the kernel will try its best to splice a small data packet into a large data packet (one MTU) and then send it out. Of course, if after a certain period of time (generally 200 ms, this value has yet to be confirmed). When the kernel is still not combined into an MTU, the existing data must also be sent (it is impossible to keep the data waiting ).
However, the implementation of TCP_CORK may not be as perfect as you think, and CORK will not completely plug the connection. The kernel does not know when the application layer will send the second batch of data for splicing with the first batch of data to reach the MTU size. Therefore, the kernel will give a time limit, if you do not splice a large package (try to get close to MTU) during this time, the kernel will send it unconditionally. That is to say, if the application layer program does not send small packets of data at short intervals, TCP_CORK will have no effect, but will lose the real-time data (each small packet of data will be delayed for a certain time before sending ).
4. Differences between the Nagle algorithm and the CORK algorithm
The Nagle algorithm and the CORK algorithm are very similar, but their focus is different. The Nagle algorithm mainly avoids network congestion due to too many packets (the proportion of protocol headers is very large, the CORK algorithm aims to improve the network utilization and minimize the proportion of protocol headers in general. In this case, the two are the same in avoiding sending packets. At the user control level, the Nagle algorithm is not controlled by the user socket. You can only simply set TCP_NODELAY and disable it, the CORK algorithm also enables or disables TCP_CORK by setting or clearing TCP_CORK. However, the Nagle algorithm is concerned with network congestion issues, and packets are sent when all ACK is returned, however, the CORK algorithm can focus on the content. It is very important to ensure that the frontend and backend packets are sent at a short interval (otherwise, the kernel will send scattered packets to you ), even if you send multiple small data packets in a distributed manner, you can use the CORK algorithm to splice the content into a package. If you use the Nagle algorithm at this time, you may not be able to do this.
In fact, the Nagle algorithm is not very complex. His main responsibility is to accumulate data. In fact, there are two thresholds: One is that the number of bytes in the slow-forward area reaches a certain level, the other is to wait for a certain period of time (generally, the Nagle algorithm waits for 200 ms). Any one of the two thresholds must send data. Generally, if the data traffic is large, the second condition will never work, but when a small packet is sent, the second threshold will play a role, it is not a good thing to prevent data from being cached infinitely in the buffer zone. After learning about the principles of the TCP Nagle algorithm, we can implement a similar algorithm by ourselves. Before doing so, we need to remember an important thing, this is also the main motivation for implementing the Nagle algorithm. I want to send data in an emergency, so an additional threshold for the above two thresholds is urgent data transmission.
For 10 times of data sending every second, the number of data sent each time is fixed to 85 ~ For 100-byte applications, if the Nagle algorithm is enabled by default, I fixed 85 data records per frame at the sender end and sent once every MS. I am at the receiver end (used in blocking mode) the accepted data alternate between 43 and 138, which may be the time threshold of the algorithm. If the Nagle algorithm is disabled, 85 frames of data are received at the receiving end.
The Nagle algorithm is suitable for scenarios with small packets and high latency. It is not suitable for B/s or c/s that require the interaction speed. When a socket is created, the Nagle algorithm is used by default, which seriously reduces the interaction speed. Therefore, the setsockopt function is required to set TCP_NODELAY to 1. however, canceling the Nagle algorithm will increase the number of TCP fragments and reduce the efficiency.
Disable the nagle algorithm to avoid impact on performance, because the control end needs to send a large number of small data packets during control, and it needs to be sent immediately.
Const char chOpt = 1;
Int nErr = setsockopt (pContext-> m_Socket, IPPROTO_TCP, TCP_NODELAY, & chOpt, sizeof (char ));
If (nErr =-1)
{
TRACE (_ T ("setsockopt () error \ n"), WSAGetLastError ());
Return;
}
Setsockopt (sockfd, SOL_TCP, TCP_CORK, & on, sizeof (on); // set TCP_CORK
Efficiency of small data packets transmitted over TCP
Abstract: when using TCP to transmit small data packets, the program design is very important. If the TCP packet is not configured in the design scheme
Latency response, Nagle algorithm, and Winsock Buffering will seriously affect program performance. This article discusses these
This section lists two cases and provides some optimization solutions for transmitting small data packets.
Background: when the Microsoft TCP stack receives a packet, a 200-millisecond timer is started. When ACK confirms data packets
After the timer is sent, the timer is reset. When the next packet is received, the timer of 200 milliseconds is started again. To improve applications
In terms of intranet and Internet transmission performance, Microsoft TCP stack uses the following policies to determine the transmission performance after packets are received.
When to send ACK validation packets:
1. If the next packet is received before the 200 millisecond timer times out, an ACK confirmation packet is sent immediately.
2. If there is a packet that needs to be sent to the receiving end of the ACK confirmation information, the ACK confirmation information is included in the packet and sent immediately.
3. When the timer times out, the ACK confirmation message is sent immediately.
To avoid the network congestion of small data packets, the Nagle algorithm is enabled by default in Microsoft TCP stack.
Call Send to concatenate the sent data and Send it together when the ACK confirmation message of the previous data packet is received. Below is the Nagle
Exceptions of algorithms:
1. If the data packet spliced by the Microsoft TCP stack exceeds the MTU value, the data will be sent immediately without waiting for the previous data
The ACK confirmation information of the package. In Ethernet, the MTU (Maximum Transmission Unit) value of TCP is 1460 bytes.
2. If the TCP_NODELAY option is set, the Nagle algorithm will be disabled, and the application will immediately call the data packet sent by Send.
Delivery to the network without delay.
To optimize performance at the application layer, Winsock copies the data sent by the application calling Send from the application buffer to Winsock.
Kernel buffer. Microsoft TCP stack uses a method similar to the Nagle algorithm to determine when data is actually delivered to the network.
The default size of the kernel buffer is 8 K. You can use the SO_SNDBUF option to change the size of the Winsock kernel buffer. If necessary,
Winsock can buffer data larger than the SO_SNDBUF buffer size. In most cases, the application completes the Send call, which only indicates data.
Copied to the Winsock kernel buffer, which does not indicate that the data is actually delivered to the network. The only exception is:
Disable the Winsock kernel buffer by setting SO_SNDBUT to 0.
Winsock uses the following rules to indicate to the application the completion of a Send call:
1. If the socket is still within the SO_SNDBUF quota, Winsock copies the data to be sent by the application to the kernel buffer to complete the Send call.
2. If the Socket exceeds the SO_SNDBUF limit and only one buffered data is sent in the kernel buffer, Winsock replication will send
To the kernel buffer to complete the Send call.
3. If the Socket exceeds the SO_SNDBUF limit and the kernel buffer has more than one buffered sending data, Winsock copies the data to be sent.
To the kernel buffer, and then ship the data to the network until the Socket falls to the SO_SNDBUF limit or only one data to be sent is left.
Complete the Send call.
Case 1
A Winsock TCP client needs to send 10000 records to the Winsock TCP server and save them to the database. Record size from 20 bytes to 100
Bytes. For simple application logic, the possible design scheme is as follows:
1. The client sends messages in blocking mode, and the server receives messages in blocking mode.
2. Set SO_SNDBUF to 0 on the client, disable the Nagle algorithm, and send each data packet separately.
3. The server calls Recv in a loop to receive data packets. Pass a 200-byte buffer to the Recv so that each record can be in a Recv call.
Obtained.
Performance:
During the test, it was found that the client could only send 5 pieces of data to the service segment per second, with a total of 10000 records, about KB. It took more than half an hour
To the server.
Analysis:
Because the client does not set the TCP_NODELAY option, the Nagle algorithm forces the TCP stack to wait for ACK confirmation of the previous data packet before sending the data packet.
Information. However, the client sets SO_SNDBUF to 0 and the kernel buffer is disabled. Therefore, only one data packet can be called for 10000 Send requests.
Packet sending and confirmation, each ACK confirmation message is delayed by 200 ms for the following reasons:
1. When the server obtains a data packet, it starts a 200 millisecond timer.
2. The server does not need to send any data to the client. Therefore, the ACK confirmation information cannot be carried along with the sent data packets.
3. The client cannot send data packets before receiving the confirmation information of the previous data packet.
4. When the timer on the server times out, ACK confirmation information is sent to the client.
How to improve performance:
There are two problems in this design. First, there is a latency problem. The client needs to be able to send two data packets to the server within 200 milliseconds.
Because the client uses the Nagle algorithm by default, the default kernel buffer should be used, and SO_SNDBUF should not be set to 0. Once TCP
The encapsulated data packet exceeds the MTU value. This data packet is immediately sent without waiting for the previous ACK to confirm. Second, this design
The scheme calls Send once for every small packet. It is not efficient to send such a small data packet. In this case
Add each record to 100 bytes and Send 80 records each time by calling Send. To let the server know the total number of records sent at a time,
The client can include a header before the record.
Case 2:
A Winsock TCP client opens two connections and communicates with a Winsock TCP server that provides the Stock Quotation service. First connection
The command channel is used to transmit the stock number to the server. The second connection is used as a data channel to receive stock quotations. After the two connections are established,
The client sends the stock number to the server through the command channel, and then waits for the returned stock quotation information on the data channel. The client receives the first
And then send the next stock number request to the server. Neither the client nor the server has set SO_SNDBUF and TCP_NODELAY.
.
Performance:
During the test, the client can obtain only five quotation records per second.
Analysis:
This design scheme allows only one stock information at a time. The first stock number is sent to the server through the command channel and received immediately
The stock quotation information returned by the server through the data channel. Then, the client immediately sends the second request message, and the send call returns immediately,
The sent data is copied to the kernel buffer. However, the TCP stack cannot immediately deliver this packet to the network because it does not receive the previous packet
ACK confirmation information. 200 milliseconds later, the timer on the server times out, and the ACK confirmation information of the first request packet is sent back to the client.
To the network. The quotation information of the second request is immediately returned from the data channel to the client.
The timer has timed out, and the ACK confirmation information of the first quote information has been sent to the server. This process occurs cyclically.
How to improve performance:
Here, the design of two connections is unnecessary. If a connection is used to request and receive quotation information, the ACK confirmation information of the stock request will
The returned quotation information will be carried back immediately. To further improve performance, the client should call Send to Send multiple stock requests at a time.
Multiple quote information is returned at a time. If two one-way connections are required for some special reasons, TCP_NODELAY should be set for both the client and server.
Option, so that small data packets are sent immediately without waiting for the ACK confirmation information of the previous data packet.
Suggestions for improving performance:
The above two cases illustrate some of the worst cases. When designing a solution to send and receive a large number of small data packets, we should follow the following suggestions:
1. If data fragments do not need to be transmitted urgently, the application should splice them into larger data blocks and then call Send. Because the sending buffer
It is very likely to be copied to the kernel buffer, so the buffer should not be too large, usually a little smaller than 8 K is very efficient. As long as the Winsock kernel buffer
To obtain a data block greater than the MTU value, several data packets are sent, leaving the last data packet. Except the last packet, the sender does not
Triggered by a timer of 200 milliseconds.
2. If possible, avoid one-way Socket data streams.
3. Do not set SO_SNDBUF to 0 unless you want to ensure that data packets are immediately delivered to the network after sending is called. In fact, 8 K buffer is suitable for most
The condition does not need to be changed again unless the new buffer is tested to be more efficient than the default size.
4. If data transmission does not require reliability, use UDP.
Summary of mtu tcp udp optimization settings in network programming