High-performance network programming 2--tcp message sending

Source: Internet
Author: User
Tags aliyun
In the previous article, we have established a TCP connection, which corresponds to one socket allocated by the operating system. When you send data over TCP, you are faced with data streams. Generally, when you call a method such as send or write to send data to another host, what happens in the operating system kernel? We will analyze the following three questions: Can the host at the other end of TCP receive the message when the sending method is returned successfully? Can you ensure that the data has been sent to the network? What is the difference between the sending method when the socket is blocked or not blocked?
To answer the above three questions, we have a lot of knowledge points. Let's first look at the TCP layer to see what the kernel did when sending method calls. I don't want to list the data structures and methods in the kernel. After all, most application developers don't need to understand this, just a rough representation, as shown below:
Figure 1 The process of sending TCP messages in a typical scenario is described in detail before 10 steps, several concepts should be clarified: MTU, MSS, tcp_write_queue sending queue, blocking and non-blocking socket, congestion window, sliding window, Nagle algorithm. When we call the sending method, the message stream constructed in our code is passed as a parameter. This message stream can be large or small, such as several bytes or several megabytes. When the message stream is large, shards may occur. Let's first discuss the sharding issue. 1. The fragment of MSs and TCP is shown in the previous article. The TCP layer is the transport layer of Layer 2, and the constraints of the IP layer and data link layer of Layer 3 also apply to the TCP layer. Let's take a look at a concept in the data link layer: the maximum transmission unit MTU. No matter what type of data link layer, there is a limit on the length of the network group. For example, the Ethernet limit is 1500 bytes, And the 802.3 limit is 1492 bytes. When the IP network layer of the kernel tries to send packets, if the length of a packet exceeds the MTU limit, it is divided into several packets smaller than the MTU. Each packet has an independent IP header. Look at the IP Header Format: Figure 2 the IP Header Format shows that the total length of the specified IP packet is a 16-bit (2 bytes) field, this means that an IP package can contain a maximum of 65535 bytes. If the TCP layer tries to send a message larger than 1500 bytes over Ethernet, when the IP network layer method is called to send a message, the IP layer automatically obtains the MTU value of the local area network, and split according to the MTU size of the network. At the same time, the IP layer expects this part to be transparent to the transport layer. the receiver's IP layer reassembles the IP packet from the sender's IP layer into a message based on the received IP packet header. The sharding efficiency of this IP layer is very poor, because all the shards must arrive before they can be merged into a package. If any of the shards is lost, all the shards must be resold. Therefore, the TCP layer will try to avoid the IP layer from performing the datagram sharding. To avoid IP layer fragmentation, the TCP protocol defines a new concept: Maximum message segment length (MSS. It defines the maximum length of a single packet that a host expects to send to the peer host over a TCP connection. When the tcp3 handshake establishes a connection, both parties must inform each other of the desired MSS size. For example (use tcpdump to capture packets ): 15:05:08. 230782 IP 10.7.80.57.64569> houyi-vm02.dev.sd.aliyun.com.tproxy: s 3027092051: 3027092051 (0) Win 8192 <MSS 1460, NOP, wscale 8, NOP, NOP, sackok> 15:05:08. 234267 IP houyi-vm02.dev.sd.aliyun.com.tproxy> 10.7.80.57.64569: s 26006838: 26006838 (0) ack 3027092052 win 5840 <MSS 1460, NOP, NOP, sackok, NOP, wscale 9> 15:05:08. 233320 IP 10.7.80.57.64543> houyi-vm02.dev.sd.aliyun.com.tproxy: P 78972532: 78972923 (391) ack 12915963 win 255In this example, the two hosts are both in the Ethernet, And the MTU of the Ethernet is 1500, minus the length of the IP address and the TCP Header. The MSS is 1460. In the three-way handshake, all Syn packets carry the desired MSS size. When the application layer calls the sending method provided by the TCP layer, the TCP module of the kernel is split in the tcp_sendmsg method according to the MSS notified by the other party, divide a message stream into multiple network groups (three network groups in 1), and then call the IP layer method to send data. Will this MSS be changed? Yes. As mentioned above, MSS is designed to avoid IP layer sharding. It is not necessarily reliable to inform the other party of the desired MSS value when a handshake is established. Because this value is estimated, if the two hosts connected to TCP are in different networks, there may be many intermediate networks, which have different data link layers, in this way, there are many MTU connections. In particular, if the MTU of the intermediate route is smaller than the network MTU of the two hosts, the selected MSS is still too large, resulting in IP layer fragmentation of the intermediate router. How can we avoid the possibility of fragments in the intermediate network? The DF flag in the IP header is used to tell all IP layer code passing through the IP packet: Do not partition the packet. If an IP packet is too large and must be split, an ICMP error is returned, indicating that the packet must be split and the MTU value to be accepted by the router to be split. In this way, the sender host on the connection can re-determine the MSS. 2. After the sending method returns a success, will the data be sent to the other end of TCP? Of course, the answer is no. Before explaining this problem, let's take a look at how TCP ensures reliable transmission. TCP regards every byte in the data stream to be sent as a serial number. Reliability means that the connection peer needs to send ACK confirmation after receiving the data, tell it how many bytes of data has been received. That is to say, how can we ensure that data is successfully sent? You must wait until the ACK of the corresponding serial number of the sent data arrives to ensure that the data is successfully sent. The send or write method provided by the TCP layer won't do this. Let's see Figure 1. What exactly does it do.
Figure 1 consists of 10 steps. (1) The application tries to call the send method to send a long piece of data. (2) the kernel is mainly completed through the tcp_sendmsg method. (3) (4) the kernel actually executes the sending of packets, and the call to the send method is not synchronous. That is, if the send method returns successfully, it may not necessarily send all IP packets to the network. Therefore, you need to copy the data in the user-state memory to the kernel-state memory, and do not rely on the user-state memory, this allows the process to quickly release the user-mode memory occupied by sending data. However, this copy operation does not simply copy the data to be sent. Instead, it divides the data to be sent into multiple shard message segments based on MSS and copies them to the sk_buff structure in the kernel for storage, at the same time, these parts form a queue and put them in the tcp_write_queue sending queue corresponding to the TCP connection. (5) The kernel cache allocated for this TCP connection is limited (/proc/sys/NET/CORE/wmem_default ). When there is no redundant kernel state cache to copy user State data to be sent, you need to call a method sk_stream_wait_memory to wait for the sliding window to move and release some caches (after receiving ACK, you do not need to cache the originally sent packets. Since the recipient has confirmed that the packets have been received, you do not need to re-Send the packets at regular intervals. In this case, the cached packets are released ). For example:
wait_for_memory:if (copied)tcp_push(sk, tp, flags & ~MSG_MORE, mss_now, TCP_NAGLE_PUSH);if ((err = sk_stream_wait_memory(sk, &timeo)) != 0)goto do_error;

Here, the sk_stream_wait_memory method accepts a timeo parameter, which is the waiting time. This time is obtained from the beginning of the tcp_sendmsg method, as follows:

timeo = sock_sndtimeo(sk, flags & MSG_DONTWAIT);

Let's look at its implementation:

static inline long sock_sndtimeo(const struct sock *sk, int noblock){return noblock ? 0 : sk->sk_sndtimeo;}

That is to say, when the socket is blocked, timeo is the sending timeout time specified by the so_sndtimeo option. If this socket is a non-blocking socket, The timeo variable is 0. In fact, sk_stream_wait_memory directly returns the error code errno to eagain for a non-blocking socket. (6) In the example in Figure 1, we assume that the socket is blocked and waited for a long time. After receiving the ACK from the other party, the sliding window releases the cache. (7) copy the remaining user-state data into the kernel-state sk_buff. (8) Finally, call methods such as tcp_push, which will eventually call the IP layer method to send packets in the tcp_write_queue queue. Note that when the IP layer returns, the packet is not necessarily sent out. (9) (10) return the sending method.
We can see from the 10 steps in Figure 1 that, whether it is using a blocking or non-blocking socket, when the sending method is returned successfully (whether it is all successful or partially successful ), it does not mean that the host at the other end of the TCP connection receives the message, nor does it mean that the local machine sends the message to the network. It just means that the kernel will try to ensure that the message is delivered to the other party.
3. Influence of Nagle algorithm, sliding window, and congestion window on the sending method Figure 1 What is the tcp_push method in step 1? Let's take a look at the main process:
Figure 3 the simple process for sending TCP messages below is a simple look at these concepts: (1) sliding windows are familiar to everyone, so we will not detail them. Both parties on the TCP connection will notify the other party of their receipt window size. The recipient's receiving window size is the size of the sending window. Tcp_push must deal with the sending window when sending data. The sending window is a value that changes at a time. As ack arrives, it will become larger and new packets will become smaller. Of course, it can only reach the window size announced by the other party when three handshakes are performed. When tcp_push sends data, the tcp_snd_wnd_test method is used to determine whether the number of the data to be sent exceeds the size of the send sliding window. For example:

// Check whether the maximum number of the message to be sent exceeds the size of the sliding window. Static inline int tcp_snd_wnd_test (struct tcp_sock * TP, struct sk_buff * SKB, unsigned int cur_mss) {// end_seq the maximum number u32 end_seq = tcp_skb_cb (SKB)-> end_seq; If (SKB-> Len> cur_mss) end_seq = tcp_skb_cb (SKB) -> seq + cur_mss; // snd_una is the smallest unconfirmed sequence number in the sent data. snd_wnd is the size of the sending window. Return! After (end_seq, TP-> snd_una + TP-> snd_wnd );}
(2) Slow Start and congestion windows because the network between the two hosts may be very complex, the intermediate router forwarding capability may be a bottleneck when the Wan is used. That is to say, if one party simply sends data according to the sliding window size advertised when the other host shakes three times, the forwarding router performance on the network may be worse, more groups are lost. At this time, each operating system kernel will add slow start and congestion avoidance algorithms to the TCP sending phase. To put it bluntly, the slow start algorithm means that the window size advertised by the other party only indicates the ability of the other party to receive TCP groups, not the ability of the intermediate network to Process Groups. Therefore, the sender should try to make sure that the network is very smooth, and then open the sending according to the notification window of the other party. The congestion window is the cwnd below, which is used to help implement slow start. When the connection is established, the size of the congestion window is much smaller than that of the sending window. It is actually an MSS. Every time an ACK is received, the congestion window expands to an MSS size. Of course, the congestion window can only reach the maximum size of the Receiving Window advertised by the other party. Of course, to avoid exponential growth, the growth of the congestion window size will be slower, which is a linear smooth growth process. Therefore, when tcp_push sends a message, it also checks the congestion window. The number of packets in flight is smaller than the number of congestion windows, and the length of sent data is smaller than the congestion window length. First, use the unsigned int tcp_cwnd_test method to check whether the number of Flight packets is smaller than the number of congestion windows (the number of MSs ):
Static inline unsigned int tcp_cwnd_test (struct tcp_sock * TP, struct sk_buff * SKB) {u32 in_flight, cwnd;/* Don't be strict about the congestion window for the final fin. */If (tcp_skb_cb (SKB)-> flags & tcpcb_flag_fin) return 1; // data in flight, that is, the total number of bytes without ACK in_flight = tcp_packets_in_flight (TP ); cwnd = TP-> snd_cwnd; // If the congestion window permits, return the number of bytes of data that can be sent based on the size of the congestion window if (in_flight <cwnd) return (cwnd-in_flight); Return 0 ;}

Use the tcp_window_allows method to obtain the minimum length of the congestion window and the sliding window, and check whether the data to be sent exceeds:

static unsigned int tcp_window_allows(struct tcp_sock *tp, struct sk_buff *skb, unsigned int mss_now, unsigned int cwnd){u32 window, cwnd_len;window = (tp->snd_una + tp->snd_wnd - TCP_SKB_CB(skb)->seq);cwnd_len = mss_now * cwnd;return min(window, cwnd_len);}
(3) Does the algorithm comply with the Nagle algorithm? The original intention of the Nagle algorithm is as follows: when an application process calls a sending method, it may send only small pieces of data each time, causing this machine to send many small TCP packets. For the execution efficiency of the entire network, small TCP packets increase the possibility of network congestion. Therefore, if possible, the adjacent TCP packets should be merged into a large TCP packet (of course, smaller than the MSs) for sending. The Nagle algorithm requires that a TCP connection can only have one small group that has not been confirmed. Other small groups cannot be sent before the group is confirmed. In the kernel, this algorithm is implemented using the tcp_nagle_test method. Let's take a simple look:
 
Static inline int tcp_nagle_test (struct tcp_sock * TP, struct sk_buff * SKB, unsigned int cur_mss, int nonagle) {// The nonagle flag is set, if (nonagle & tcp_nagle_push) return 1; // if the group contains four FIN packets that shake hands to close the connection, you can also send out if (TP-> urg_mode | (tcp_skb_cb (SKB)-> flags & tcpcb_flag_fin) return 1; // check the Nagle algorithm if (! Tcp_nagle_check (TP, SKB, cur_mss, nonagle) return 1; return 0 ;}

Let's take a look at the tcp_nagle_check method. It is different from the previous method. If the returned value is 0, it indicates that it can be sent. If the returned value is not 0, it is not. The opposite is true.

Static inline int tcp_nagle_check (const struct tcp_sock * TP, const struct sk_buff * SKB, unsigned mss_now, int nonagle) {// check whether it is a small group first, whether the message length is smaller than mssreturn (SKB-> Len <mss_now & (nonagle & tcp_nagle_cork) | // If the Nagle algorithm is enabled (! Nonagle & // if a group has been sent (packets_out indicates the group in "flight"), you have not confirmed TP-> packets_out & tcp_minshall_check (TP ))));}

Finally, let's take a look at what tcp_minshall_check has done:

Static inline int tcp_minshall_check (const struct tcp_sock * TP) {// The last sent small group has not been confirmed return after (TP-> snd_sml, TP-> snd_una) & // the sequence number to be sent must be greater than or equal to the sequence number corresponding to the last send group! After (TP-> snd_sml, TP-> snd_nxt );}

Imagine a scenario where the Nagle algorithm can be disabled when the request latency is very important and the network environment is very good (for example, in the same data center. You can disable the Nagle algorithm by using the tcp_nodelay socket option. Let's see how setsockopt works with the above method:

Static int do_tcp_setsockopt (struct sock * SK, int level, int optname, char _ User * optval, int optlen )... switch (optname ){... case tcp_nodelay: If (VAL) {// If tcp_nodelay is set, update the nonagle mark TP-> nonagle | = tcp_nagle_off | tcp_nagle_push; tcp_push_pending_frames (SK, TP );} else {TP-> nonagle & = ~ Tcp_nagle_off;} break ;}}

We can see that the nonagle flag is changed in this way.
Of course, after the IP layer method is called to return data, it may not be guaranteed that the data will be sent to the network at this time. Next we will discuss how to receive TCP messages and what the kernel has done after receiving ack.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.