TCP/IP specific explanation--tcp/udp optimization settings Summary & Introduction to MTU

Source: Internet
Author: User
Tags ack

The first thing to look at is TCP/IP protocol, which involves four layers: link layer, network layer. Transport layer, application layer.


The data frame of the Ethernet (Ethernet) in the link layer
IP packets on the network layer
TCP or UDP packets in the transport layer
data in TCP or UDP is in the application tier
Their relationship is the data frame {IP packet {TCP or UDP packet {data}}}
---------------------------------------------------------------------------------
The maximum length of data we use in an application is directly dependent on the underlying constraints.
Let's analyze it from the bottom up:
1. at the link layer , the physical nature of the Ethernet determines the length of the data frame (46+18)-(1500+18), where 18 is the head and tail of the data frame, which means that the content of the data frame is the largest(not including the frame head and the end of the frame). That is, the MTU (Maximum transmission Unit) is 1500;
2. at the network layer . Since the IP packet header takes 20 bytes, the MTU is 1500-20=1480;
3. in the Transport layer , the header of the UDP packet takes 8 bytes. So the MTU for this is 1480-8=1472.
So, at the application level, your data has a maximum length of 1472.

(when there is more data in our UDP packet than the MTU (1472), the IP layer of the sender needs to be fragmented fragmentation for transmission, while the receiver IP layer needs to be reorganized in the datagram because UDP is an unreliable transport protocol. Assume that fragmentation causes a reassembly failure. will cause UDP packets to be discarded).

  
Judging from the above analysis. In a normal LAN environment, UDP has a maximum of 1472 bytes of data (avoid fragmentation reassembly).
But in network programming. Routers in the Internet may be set to different values (less than the default), and thestandard MTU value on the Internet is 576. Therefore, the Internet's UDP programming time data length is best within 576-20-8=548 bytes.
---------------------------------------------------------------------------------
The MTU is very important for our UDP programming. How do you see the MTU value of a route?
For Windows Os:ping-f-L such as: Ping-f-L 1472 192.168.0.1
Suppose hint: Packets needs to is fragmented but DF set. It indicates that the MTU is less than 1500, and continuously changes the small data_length value, can finally calculate the gateway MTU value;
For Linux os:ping-c-m do-s such as: Ping-c 1-m do-s 1472 192.168.0.1
Suppose the hint Frag needed and DF set ... Indicates that the MTU is less than 1500. can then be measured to calculate the gateway's MTU.

principle: The ping program uses ICMP packets. The ICMP packet header accounts for 8 bytes. The IP datagram header accounts for 20 bytes, so add 28 bytes to the MTU value based on the data size.

---------------------------------------------------------------------------------

the maximum length of an IP packet is 64K bytes (65535), and the maximum number of 2 bytes that can be expressed is 65535 because the length of the narrative message is described in 2 bytes in the IP header.

Because the IP protocol provides the ability to cut and reorganize messages for upper layer protocols, the packet length of the Transport Layer protocol is, in principle, unrestricted.

In fact, there are limitations, because the IP packet identification field will not be infinitely long, according to IPV4. It seems that the upper limit should be 4G (64k*64k).

Rely on such a mechanism. There is no "package length" field in the TCP header. and rely entirely on the IP layer to process the sub-frame.

This is why TCP is often referred to as a "streaming protocol." Developers do not have to care about the size of the packet when using the TCP service. Just say socket as an entry to a data stream. The data is placed in the inside, the TCP protocol itself will be congestion/flow control.



UDP is different from TCP, with a total length field in the UDP header. The same is two bytes, so the total length of the UDP packet is limited to 65535, so that it can fit into an IP packet, making the UDP/IP protocol stack easy and efficient. 65535 minus the 8 bytes occupied by the UDP header itself. The maximum payload length in a UDP service is only 65527.

This is the value that you specify when you call GetSockOpt () to get the return value of so_max_msg_size, regardless of what use the Sock_dgram attribute of the socket, the data of a send cannot exceed this value, or you must get an error.

So, how will IP packets be handled when they are submitted to the lower level protocol? This depends on the data Link layer protocol, where the General Data Link layer protocol is responsible for cutting the IP packet into smaller frames and then reorganizing it on the destination side. On the Ethernet, the size of the data link frame as mentioned by several heroes. In the case of IP over ATM, the IP packet will be cut into one ATM Cell with a size of 53 bytes.

Some typical MTU values are:

Network: MTU bytes
Super Channel 65535
16mb/s Information Token Ring (IBM) 17914
4mb/s Token Ring (IEEE802.5) 4464
FDDI 4352
Ethernet
ieee802.3/802.2 1492
X. 576
Point-to-point (low latency) 296

Path MTU: Assuming that communication between the two hosts passes through multiple networks, the link layer of each network may have different MTU. What is important is not the MTU value of the network on which the two hosts are located, but the minimum MTU in the two communication host paths. It is called the path MTU.

The Nagle algorithm in TCP transmission

The TCP/IP protocol. No matter how much data is sent. Always precede the data with the protocol header, at the same time, the other party receives the data. You also need to send an ACK to indicate confirmation. To make the best use of network bandwidth. TCP always wants to send large enough data as much as possible.

(a connection will set the MSS parameters, so.) TCP/IP wants to send data in MSS-sized chunks each time.

The Nagle algorithm is designed to send large chunks of data as much as possible to avoid flooding the network with small chunks of data.

The basic definition of the Nagle algorithm is that at random moments, at most, there can be only one small segment that is not recognized.

The so-called "small paragraph" refers to the size of the data block is smaller than the MSS, so-called "not confirmed." Refers to a data block sent out, not received by the other party sent an ACK acknowledgement that the data has been received.

1. Rules for the Nagle algorithm:

(1) If the package length reaches MSS, then consent to send.

(2) Assuming that the fin is included. Then agree to send it.

(3) When the Tcp_nodelay option is set, it is agreed to send.

(4) When the Tcp_cork option is not set, if all the small packets sent out (packet length is less than MSS) are confirmed, then agree to send;

(5) The above conditions are not satisfied, but a time-out (generally 200ms), then send immediately.

The Nagle algorithm simply agrees that a packet that is not ACK is present in the network, and that it is actually an extended stop-and-wait protocol regardless of the size of the packet. It's just that it's based on the packet stop--and not based on the byte stop--and so on. The Nagle algorithm is entirely determined by the ACK mechanism of the TCP protocol, which leads to some problems, such as assuming that the end ACK reply is very fast, the Nagle will not actually splice too many packets, although the network congestion is avoided. The overall utilization of the network is still very low.

Nagle algorithm is Silly window syndrome (SWS) a half-set of prevention algorithms. the SWS algorithm prevents the sending of a small amount of data, and the Nagle algorithm is its implementation on the sender side. The receiver does not advertise a very small increase in buffer space. Small forms are not notified unless there is a significant increase in buffer space. The significant growth here is defined as the total size of the segment (MSS) or growth to be greater than half of the maximum form.

Note: The BSD implementation is to agree to send large write operations on the spare link to the last small segment, that is, when more than 1 MSS data sent, the kernel first sent a packet of N MSS, and then sent the tail of the small packets, during which no delay wait.

(If the network is not plugged and the receiving form is large enough).

For example, a write operation that starts a client-side call to the socket writes an int data (called a block) to the network, because the connection is spare (that is, there are no unacknowledged segments), so the int data is immediately sent to the server side, and then the The client also calls the write operation to "\ r \ n" (for short, b-block). This time. ACK of block A does not return. So can feel that there is already an unconfirmed small segment, so B block is not immediately sent, waiting for a block ACK received (after about 40ms), B is sent. What the whole process sees:

There is also a hidden problem, that is, the ack of a block of data is not received after 40ms? This is due to the non-Nagle algorithm in TCP/IP. Another TCP acknowledgement delay mechanism. When the server side receives the data, it does not immediately send an ACK to the client side, but rather delays the sending of the ACK for a period of time (if t). It expects the server to send answer data to the client side in the T-time, so that the ACK is sent along with the reply data, as if the answer data is in the past with an ACK. In the time before me, T is probably 40ms. This explains why ' \ r \ n ' (Block B) is always 40ms after a block.

Of course. The TCP acknowledgment delay of 40ms is not always constant. The delay acknowledgement time for a TCP connection is generally initialized to a minimum of 40ms, which is then continuously adjusted based on the connection's retransmission timeout (RTO), the last packet received, and the time interval of the packet being received.

You can also cancel the acknowledgement delay by setting the Tcp_quickack option.

Specific introduction to TCP Acknowledgement delay: http://blog.csdn.net/turkeyzhou/article/details/6764389

2. Tcp_nodelay Options

By default, data is sent using the Negale algorithm. This improves the network throughput, but the real-time is reduced. In some highly interactive applications, it is not possible to use the Tcp_nodelay option to disable the Negale algorithm.

At this time Every packet that the application submits to the kernel is sent out immediately.

It is important to note that although the Negale algorithm is forbidden. However, the transmission of the network is still affected by the TCP acknowledgement delay mechanism.

3. Tcp_cork Options

The so-called cork is the meaning of the plug, the image of the understanding is to use the cork will plug the connection, so that the data first not sent out, wait until the plug after the hair out. After setting this option. The kernel will try to stitch the small packet into a large packet (an MTU) and send it out, of course, if after a certain amount of time (typically 200ms, the value is still to be confirmed), the kernel does not combine into an MTU and must send the existing data (it is impossible to keep the data waiting for it).

However. The implementation of tcp_cork may not be as perfect as you think, Cork will not completely plug the connection.

The kernel does not actually know when the application layer will send the second batch of data to the first batch of data to achieve the MTU size. The kernel will therefore give a time limit, and the kernel will send it unconditionally if it is not stitched into a large package (trying to approach the MTU). That is, if the application layer program to send packets of data is not short enough, tcp_cork has no effect, but lost the real-time data (every packet of data will delay a certain time to send).

4. Nagle algorithm and Cork algorithm difference

The Nagle algorithm is very similar to the cork algorithm. But their focus is not the same, the Nagle algorithm mainly avoids the network due to too many packets (the proportion of the protocol head is very large) and congestion, and the cork algorithm is to improve the network utilization, so that the overall protocol head occupies as small as possible. So it seems that the two are consistent in avoiding sending packets on a user-controlled level. Nagle algorithm is completely not controlled by the user socket, you can simply set the Tcp_nodelay and disable it, Cork algorithm is also set or clear tcp_cork enable or disable it, but the Nagle algorithm is concerned about the network congestion problem. Only the full ACK is returned to the contract. But the cork algorithm can be concerned about the content, before and after the packet send interval is very short premise (very important. Otherwise the kernel will help you to distribute the scattered packets, even if you are scattered to send a number of small packets, you can also be able to cork algorithm to be able to splice these content in a package, assuming that at this time with the Nagle algorithm, you may not do this.

In fact, the Nagle algorithm is not very complex. His main responsibility is the accumulation of data, there are actually two thresholds: one is that the number of bytes in the buffer reached a certain amount, and one is waiting for a certain time (the general Nagle algorithm is waiting for 200ms). Both thresholds have to be sent regardless of what one reaches. In general the case. Assuming the data traffic is very large, the second condition will never work, but when sending a small packet, the second threshold will work. Preventing data from being cached indefinitely in buffers is not a good thing to do oh. After understanding the principle of TCP's Nagle algorithm, we are able to implement a similar algorithm by ourselves, before we start to remember an important thing, but also the main motive of our hands-on implementation of the Nagle algorithm is that I want to send the data urgently, So for the above two threshold to add a threshold is the emergency data sent.

For me today data is sent 10 times per second. Each data send amount is fixed in the 85~100 byte application. Assume that the default open Nagle algorithm is used. I am on the sending side, fixed 85 per frame of data, interval 100ms sent once, I accept the end (blocking mode use) the received data is 43 138 alternating, may be the algorithm's time threshold problem, assuming the close Nagle algorithm, at the receiving end can guarantee that the data received each receive is 85 frames.

The Nagle algorithm is suitable for small packets and high-latency occasions, but not for B/s or C + +, which requires interactive speed .

Socket at the time of creation. By default , the Nagle algorithm is used, which results in a severe decrease in interaction speed, so the setsockopt function is required to set the Tcp_nodelay to 1. Only the Nagle algorithm is canceled, resulting in increased TCP fragmentation. Efficiency may be reduced.

Turn off the Nagle algorithm to avoid impacting performance. Due to the control side to send a very small amount of data packets, need to send immediately.

const char chopt = 1;

int nerr = setsockopt (Pcontext->m_socket, Ipproto_tcp, Tcp_nodelay, &chopt, sizeof (char));

if (Nerr = =-1)

{

TRACE (_t ("setsockopt () error\n"), WSAGetLastError ());

Return

}

SetSockOpt (SOCKFD, Sol_tcp, Tcp_cork, &on, sizeof (on)); Set Tcp_cork

TCP Transmission Small packet efficiency problem

Summary: When using TCP to transmit small packets. The design of the program is quite important. Assuming that the TCP packet is incorrect in the design scenario
Delay response, Nagle algorithm. The importance of Winsock Buffering will seriously affect the performance of the program. This article discusses these
Question, listing two cases. Some optimized design schemes for transmitting small packets are given.

Background: When a packet is received by the Microsoft TCP stack, a 200-millisecond timer is started. When ACK acknowledgement packet
After it is issued, the timer resets and receives the next packet. A timer of 200 milliseconds is started again. To enhance the application
For transport performance on the intranet and on the Internet, the Microsoft TCP stack uses the following policies to determine when a packet is received
When to send ACK acknowledgement packets:
1, assume that the timer expires in 200 milliseconds. Receives the next packet. The ACK acknowledgement packet is sent immediately.
2. Assuming that there is currently exactly one packet that needs to be sent to the receiving end of the ACK acknowledgement, the ACK acknowledgement is attached to the packet and sent immediately.


3, when the timer expires, the ACK confirmation message is sent immediately.


algorithm:
1, assuming Microsoft The TCP stack is spliced with packets that exceed the MTU value, and this data is sent immediately, without waiting for the previous data
packet ack acknowledgement information. In Ethernet, the MTU (Maximum transmission Unit) value of TCP is 1460 bytes.
2, assuming that the Tcp_nodelay option is set. The Nagle algorithm is disabled. The packet sent by the application call to send is immediately
posted to the network without delay.
To optimize performance at the application layer, Winsock copies the data sent by the application call send from the application's buffer to the Winsock
kernel buffer. The Microsoft TCP stack uses a method similar to the Nagle algorithm to decide when to actually post data to the network.


The default size of the kernel buffers is 8K, and the SO_SNDBUF option allows you to change the size of the Winsock kernel buffers. Assuming it is necessary.
Winsock can buffer data larger than the SO_SNDBUF buffer size. In most cases, the application finishes the send call only to indicate that the data
Being copied to the Winsock kernel buffer does not indicate that the data was actually posted on the network.

The only exception to this situation is:
The Winsock kernel buffer is disabled by setting So_sndbut to 0.

Winsock uses the following rules to indicate to the application that a send call is complete:
1. Assuming that the socket is still within the SO_SNDBUF limit, Winsock replicates the data sent by the application to the kernel buffer. Complete the send call.


2. If the socket exceeds the SO_SNDBUF limit and there is only one buffered send data in the kernel buffer, the Winsock copy is sent
Data to the kernel buffer, complete the send call.
3. If the socket exceeds the SO_SNDBUF limit and the kernel buffer has more than one buffered send data, Winsock replicates the data to be sent
To the kernel buffer and then post the data to the network. Until the socket drops to the SO_SNDBUF limit or only one of the remaining data to be sent
Complete the send call.

Case 1
A Winsock tcpclient needs to send 10,000 records to the Winsock TCP server and save to the database. Record size from 20 bytes to 100
bytes are not equal.

For simple application logic, the possible design scenarios are as follows:
1, the client to plug the way to send. The service side is received in a blocked manner.
2, the client set SO_SNDBUF is 0. Disable the Nagle algorithm so that each packet is sent separately.
3. The service side calls recv in a loop to receive the packet. Pass a 200-byte buffer to recv for each record in a recv call
Be acquired to.

Performance:
Found in the test. The client can only send 5 pieces of data to the service segment per second. A total of 10,000 records, about 976K bytes. Spent more than half an hour
Before all of them are uploaded to the server.

Analysis:
Because the client does not have the Tcp_nodelay option set, the Nagle algorithm forces the TCP stack to wait for the ACK acknowledgement of the previous packet before sending the packet
Information.

However, the client setting So_sndbuf is 0 and the kernel buffer is disabled. Therefore, 10,000 send calls can only be a single packet of data
The packet is sent and acknowledged for the following reasons, each ACK acknowledgment information is delayed by 200 milliseconds:
1. When the server gets to a packet, start a 200-millisecond timer.
2, the service side does not need to send to the client no matter what data, so. ACK acknowledgement information cannot be sent back to the packet to be carried.
3. The client cannot send the packet until it has received the confirmation information from the previous packet.
4, the service side of the timer after the timeout. ACK acknowledgment information is sent to the client.

How to Improve performance:
There are two problems in this design. First, there is a delay problem. The client needs to be able to send two packets to the server within 200 milliseconds.
Because the client uses the Nagle algorithm by default, the default kernel buffer should be used, and the SO_SNDBUF should not be set to 0. Once the TCP
Stacks of packets that are stitched together exceed the MTU value. This packet will be sent immediately without waiting for the previous ACK acknowledgement message. Second, this design
The scenario calls send once for each packet that is so small.

It is not very efficient to send such a small packet. In such a case. Should
Add each record to 100 bytes and send 80 records per call to send. In order for the server to know how many records were sent at a time,
The client can carry a header message in front of the record.

Case TWO:
A Winsock TcpClient program opens two connections and a Winsock TCP server-side communication that provides a stock quote service.

First connection
Used as a command channel to transfer stock numbers to the service side. The second connection is used as a data channel to receive stock quotes. After two connections are established,
The client sends the stock number to the server via the command channel. Then wait for the returned stock quote information on the data channel.

The client receives the first
Stock quote information After sending the next stock number request to the service side. Both client and server are not set SO_SNDBUF and Tcp_nodelay
Options.

Performance:
The test found that the client can only get 5 quotes per second.

Analysis:

The design only agreed to obtain a stock information at a time.

The first stock number information is sent to the server via a command channel. I'll get it.
The stock quote information returned by the service side through the data channel. And then. The client immediately sends the second request message. The send call returns immediately,
The sent data is copied to the kernel buffer. However, the TCP stack cannot immediately deliver this packet to the network. Since the previous packet was not received
ACK acknowledgement information. After 200 milliseconds. The service-side timer timed out, and the ACK acknowledgment information for the first request packet was sent back to the client. Client
A second request package is posted to the network. The second requested quote information is immediately returned from the data channel to the client, due to this. Client's
The timer has timed out and the ACK acknowledgement information for the first quote message has been sent to the server.

This process occurs in cycles.

How to Improve performance:
Over here. The design of two connections is not necessary.

Assuming that a connection is used to request and receive quotation information, the ACK confirmation information for the stock request is
The returned quote information will be brought back immediately.

To further improve performance, the client should call send once to send multiple stock requests, the server
Returns more than one quote information at a time. Suppose that for some special reason it is necessary to use two one-way connections, both client and server should be set Tcp_nodelay
option so that small packets are sent immediately without waiting for the ACK acknowledgment information of the previous packet.

Recommendations for improved performance:
The two cases above illustrate some of the worst cases. When designing a solution to solve a large number of small packets sent and received, the following recommendations should be followed:
1, if the data fragments do not need urgent transmission. The application should stitch them up into larger chunks before calling send. Because the Send buffer
is very likely to be copied to the kernel buffer, so the buffer should not be too large, usually a little bit smaller than 8K is very efficient.

Only the Winsock kernel buffers are
With a block of data greater than the MTU value, several packets are sent. The last packet left.

The sender, in addition to the last packet, will not
triggered by a timer of 200 milliseconds.
2, if possible, avoid one-way socket data stream.


3, do not set SO_SNDBUF to 0. Unless you want to ensure that packets are posted to the network immediately after the call to send is complete. In fact, the 8K buffer fits most
Case There is no need to change again. Unless the newly set buffer is tested, it is indeed more efficient than the default size.
4, assume that the transmission of data does not guarantee reliability, using UDP.

TCP/IP specific explanation--tcp/udp optimization settings Summary & Introduction to MTU

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.