[To] TCP/IP principles, Fundamentals, and implementations on Linux

Last Update:2014-07-26 Source: Internet

Author: User

Tags ack bit set keep alive

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Introduction: This article as a theoretical basis, will tell us the basic principles of TCP/IP and important protocol details, and on this basis, introduced the implementation of TCP/IP on Linux.

OSI Reference Model and TCP/IP Reference Model

The OSI model (open System Interconnection Reference Model) has been developed based on the recommendations of the International Organization for Standardization (ISO), which is divided into 3-1 seven layers. With the advent of satellites and wireless networks, existing protocols are having problems connecting with these networks, so a new reference architecture is needed to seamlessly connect to multiple networks. This architecture is the TCP/IP reference model.

TCP protocol

The internet has two main protocols in the Transport layer: a connection-oriented protocol and a non-connected protocol. TCP is a protocol (Transmission Control Protocol) dedicated to providing reliable, end-to-end byte-stream traffic on unreliable internet. TCP services can be obtained by creating a communication port called a socket, respectively, on both the sender and receiver sides. All TCP connections are full-duplex and point-to-dot.

The sending and receiving TCP entities Exchange data in the form of datagrams. A datagram contains a fixed 20-byte header, an optional part, and 0 or more bytes of data. There are two restrictions on the size of datagrams: First, each datagram (including the TCP header) must be suitable for the payload capacity of the IP, not more than 65535 bytes, and secondly, each network has the maximum Transmission Unit MTU (maximum Transfer Unit), Requires that each datagram must be suitable for the MTU. If a datagram enters a network with an MTU that is less than the length of the datagram, routers on the network boundary decompose the datagram into smaller datagrams.

The basic protocol used by TCP entities is the sliding window protocol. When the sender transmits a datagram, it starts the timer. When the datagram arrives at the destination, the receiver's TCP entity sends back a datagram that contains a confirmation ordinal that is equal to the order number of the next datagram that you want to receive. If the sender's timer times out before the confirmation message arrives, the sender will resend the datagram.

2.1 TCP Data Header

Figure 2 shows the format of the TCP data header.

Source port, Destination port: 16 bits long. Identifies the remote and local port numbers.

Sequence Number: 32 bits long. Indicates the order in which datagrams are sent.

Confirmation Number: 32 bits long. The serial number of the next datagram that you want to receive.

TCP head Length: 4 bits long. Indicates how many 32-bit words are contained in a TCP header.

The next 6 bits are unused.

Ack:ack Position 1 indicates that the confirmation number is legal. If the ACK is 0, the datagram does not contain confirmation information and the confirmation field is omitted.

PSH: Represents the data with the push flag. The receiver therefore requests that the datagram be sent to the application without waiting for the buffer to be filled before it is delivered.

RST: Used to reset errors that occur due to host crashes or other causes. It can also be used to reject illegal datagrams or deny connection requests.

SYN: Used to establish a connection.

FIN: Used to release the connection.

Window Size: 16 bits long. The window size segment indicates how many bytes can be sent after the byte has been confirmed.

Checksum: 16 bits long. is set to ensure high reliability. It verifies the sum of headers, data, and pseudo-TCP headers.

Available options: 0 or more 32-bit words. Includes options such as maximum TCP load, window scale, and selection of re-send datagrams.

Maximum TCP load: Allows each host to set the maximum TCP payload capacity it can accept. During the establishment of the connection, both parties declare their maximum load capacity and choose the smaller one as the standard. If a host does not use this option, its load capacity is set to 536 bytes by default.
Window Scale: Allows the sender and receiver to agree on a suitable window scale factor. This factor allows the sliding window to reach a maximum of 232 bytes.
Select Resend Datagram: This option allows the receiving party to request the sending of one or more datagrams specified.

2.2 Connection Management

Establish a connection in TCP with a three-time handshake method. In order to establish a connection, one party, such as a server, passively waits for a reachable connection request by performing listen and the Accept primitive.

The other party, such as the client, executes the connect primitive, specifying the IP address and port number it wants to connect to, setting the maximum number of TCP datagrams it can accept, and some optional user data. The Connect primitive sends a syn=1,ack=0 data to the destination and waits for the other party to respond.

After the data is reported to the destination, the TCP entity there will see if there is a process on the port specified in the Destination port field for the listener. If it does not, it will send a rst=1 answer, refusing to establish the connection.

If a process is listening on the port, the incoming TCP datagram is handed over to the process, which can accept or deny the connection. If accepted, a confirmation datagram is sent back. In general, TCP connection Setup is shown in procedure 3.

In order to release the connection, each party can send a fin=1 TCP datagram, indicating that the party has no data sent. When the fin datagram is confirmed, the connection in that direction is closed. When a connection in two directions is closed, the connection is completely released. In general, releasing a connection requires 4 TCP datagrams: One fin datagram and one ACK datagram in each direction.

2.3 Transport Policy

In TCP, a sliding window is used for transmission control, and the size of the sliding window means that the receiver has a large buffer that can be used to receive data. The sender can determine how many bytes of data should be sent by sliding the size of the window. When the sliding window is 0 o'clock, the sender is generally no longer able to send datagrams, except in two cases where emergency data can be sent, for example, to allow the user to terminate the running process on the remote machine. Another scenario is that the sender can send a 1-byte datagram to notify the receiver to re-declare the next byte it wants to receive and the size of the sender's sliding window.

2.4 Congestion Control

Congestion occurs when the load capacity that is loaded on a network exceeds its processing capacity. There are two potential problems for the Internet-the capacity of the network and the capacity of the receiver, which should be handled separately. The sender always maintains two windows: The receiver acknowledges the window and the Congestion window. Take the minimum value of two windows as the number of bytes that can be sent.

When a connection is established, the sender initializes the congested window size to the maximum datagram length used for the connection, and then sends a maximum-length datagram. If the datagram is confirmed before the timer expires, the sender adds a datagram's byte value to the original congestion window, making it twice times the maximum datagram size, and then sending two datagrams. When each of these datagrams is confirmed, the congestion window size increases the length of one maximum datagram. When the congestion window is the n number of reported size, if all the n sent is reported to be confirmed in time, then the congestion window size is increased by n number of bytes reported corresponding. The Congestion window maintains exponential regularity until the data transfer times out or reaches the receiver's set window size. The congestion window is set to exactly the number of bytes that do not cause a timeout or reach the window size of the receiving party.

2.5 Timer Management

TCP uses multiple timers, such as a re-send timer, a continuous timer, a "Keep alive" timer, and so on. The most important thing is to re-send the timer. Initiates a data resend timer while sending a datagram. If the datagram is confirmed before the timer expires, the timer is turned off, but if the timer expires before the confirmation arrives, the datagram needs to be re-sent.

The continuous timer is used to prevent deadlock situations. When a connection is idle for a long time, the "Keep alive" timer will time out to detect whether the other party still exists. If it does not get a response, it terminates the connection.

UDP protocol

The Internet Protocol Group also supports non-connected transport protocol UDP (User Data Protocol). UDP uses the underlying Internet Protocol to transmit messages, providing unreliable, non-connected datagram transmission services like IP. It does not use confirmation information to confirm the arrival of the message, does not order the received datagram, and does not provide feedback information to control the flow of information between the machines. The reliability of UDP communication, including the loss of messages, duplication, disorderly sequence, etc., by the use of UDP application to bear.

A UDP datagram consists of a 8-byte header and data section. The format of the header is shown in 4, which includes four fields that are 16 bytes long. The source port and destination port are the same as in TCP, and are used to indicate the port number of the source and destination ends. The UDP Length field indicates the datagram length, including 8-byte headers and data. The UDP checksum field is optional and is used to record the checksum of UDP headers, UDP pseudo-headers, and user data.

IP protocol

The IP protocol provides unreliable, non-connected datagram transmission mechanisms. TCP/IP is designed to accommodate the diversity of physical networks, and this adaptability is mainly reflected through the IP layer. Due to the diversity of physical networks, there is a great difference between the data frame format and the address format of various physical networks. In order to shield these underlying details, so that the network of different physical networks to communicate between, TCP/IP, respectively, the use of IPs and IP address as a physical data frame and physical address of the unified description form. In this way, IP to the upper layer to provide a unified IP datagram and unified IP address, so that the differences between physical frames and physical address of the upper layer protocol no longer exist.

4.1 IP Data Header

An IP datagram consists of a header and a data part. The head includes a fixed-length portion of 20 bytes and an optional arbitrary length section. The header format is shown in 5.

Version: 4 bits long. The protocol version number that corresponds to the datagram is recorded. There are two versions of the current IP protocol: IPV4 and IPV6.

Ihl:4 bit length. Represents the total length of the head, in 32-byte units.

Service type: 8 bits long. So that the host can tell the subnet what service it wants. As shown, the Service Type field is divided into 5 parts. The priority field is the priority of the flag, and the three flags represent delay, throughput, and reliability, respectively.

Total length: 16 bits. The total length of the finger and the data. The maximum length is 65,535 bytes.

ID: 16 bits. It enables the destination host to determine which group the new fragment belongs to, and all segments that belong to the same grouping contain the same identity value.

DF: Represents no fragmentation. It commands the router not to fragment the datagram because the destination cannot reassemble the fragment.

MF: There is further segmentation, which is used to indicate whether all the groupings have arrived. This bit is set for all segments except the last fragment.

Segment offset: 13 bits. Indicate where the segment is in the current datagram.

Lifetime: 8 bits. A counter used to limit the packet life cycle. It decrements in each node and can be decremented when queued in a router.

Protocol: 8-bit. The description sends the packet to that transmission process, such as TCR, VDP, and so on.

Header checksum: 16 bits. Only used to verify the head.

Source Address: 32 bits. The source host IP address that generated the IP datagram.

Destination Address: 32 digits. IP address of the destination host for the IP datagram.

Optional: is variable length. Each option is labeled with a single byte of content. Some options also follow one byte of the optional length field, followed by one or more data bytes. Security, strict source routing, loose source routing, record routing, and time stamp are now defined in five selectable options. However, not all routers support all 5 optional options.

The security options describe how secure the information is.

The strict source routing option gives the full path from source to destination in a series of IP addresses. Datagrams must be sent strictly from this path. This field is useful when a routing table crashes, when a system administrator sends an emergency group, or when a time measurement is made.

The loose source routing option requires grouping across the listed routers, but it can pass through the other routers in the meantime.

Logging routing options allows routers along the way to add their IP addresses to optional fields, which allows system administrators to track errors in routing algorithms.

The time stamp option, like the record Routing option, records a 32-bit time stamp for each router in addition to the 32-bit IP address. Similarly, this option can be used to check for errors in routing algorithms.

Segmentation and reorganization of IP datagram in 4.2

An IP datagram is transmitted through encapsulation as a physical frame. Since the internet is interconnected through a variety of different physical network technologies, the size of the physical frame (maximum Transmission unit MTU) may vary in different parts of the Internet. To maximize the ability to take advantage of the physical network, the IP module determines the size of the IP datagram based on the MTU of the physical network in which it resides. When IP datagrams are transmitted between networks of two different MTU, fragmentation and reassembly of IP datagrams can occur.

There are three IP header domains that control fragmentation and reassembly in the IP header: Identity domain, flag domain, segmented offset domain. The identity is the identifier that the source host assigns to the IP datagram. The destination host determines which datagram the received IP datagram fragment belongs to by identifying the domain to perform the IP datagram reorganization. The DF bit in the flag domain identifies whether the IP datagram allows fragmentation. When the IP datagram needs to be segmented, the gateway discards the IP datagram and sends an error message to the source host if DF is at position 1. The MF bit in the Flag field identifies whether the IP datagram fragment is the last fragment. The Segment offset field records the offset of the IP datagram fragment in the original IP datagram. The offset is an integer multiple of 8 bytes. The Segment offset field is used to determine the order in which the IP datagram fragment is reorganized at the IP datagram.

Once the IP datagram is transmitted, the segments are transmitted as separate IP datagrams and may be re-or multiple-segmented before reaching the destination host. However, the reorganization of IP datagram fragmentation is done only at the destination host.

4.3 IP handling of input datagrams

There are two kinds of IP input datagram processing, one is the host to the datagram processing, and the other is the gateway to the datagram processing.

When the IP datagram arrives at the host, if the destination address of the IP datagram matches the host address, IP receives the datagram and passes it to the Advanced protocol software, otherwise discards the IP datagram.

The gateway is different, when the IP datagram arrives at the gateway IP layer, the gateway first determines whether this machine is the destination host for the datagram arrival. If yes, the gateway will receive an IP datagram that is passed on to the Advanced Protocol software processing. If not, the gateway will route to the incoming IP datagram and then forward it.

4.4 IP-to-output datagram processing

IP-to-output datagram processing is also divided into two kinds, one is the host to the data processing, one is the gateway to the data processing.

For the gateway, after the IP datagram is received, the IP datagram's transmission path is found. The path is actually the IP address of the next gateway in the full path. The gateway then gives the IP datagram and the address of the next gateway to the network interface software. After the network interface software receives the IP datagram and the next gateway address, it first calls ARP to complete the mapping of the next gateway IP address to the physical address, then encapsulates the IP datagram into frames, and finally completes the physical transmission of the datagram by subnet.

ICMP protocol

ICMP (Internet Control message Protocol)-Internet Controlled message protocol. ICMP is mainly used for the construction of error information and control information and the acquisition of some network information. ICMP and IP belong to the same IP layer, but the ICMP packet is sent out as IP datagram after IP encapsulation. ICMP is not treated as an independent protocol hierarchy, because ICMP is not the basis of the upper layer protocol and is not conceptually an independent hierarchy.

ICMP messages include the following types: Destination Unreachable, timeout, parameter problems, source-side suppression, redirection, echo request, echo response, timestamp request, time stamp response.

The purpose of the unreachable message is to report that the subnet or router cannot locate the destination, or that grouping with DF bit set cannot bypass the "small Packet" network.

Timeout messages are used to report that a message is discarded because the timer is zero.

The parameter problem message indicates that an illegal value was found in the header field.

Source-side suppress messages are used to suppress hosts that send too many packets. When the host receives this message, it will slow down the sending speed.

The redirect message is sent when the router discovers that a routing error may have occurred.

echo Request and ECHO Response messages are used to test whether the purpose is up and running properly. Receive Echo Request message, the destination should send back an echo reply message. The time stamp request and the time stamp answer are similar, except that the message arrival time and response time should be added to the answer, and the benefit is that it can be used to test network performance.

The implementation of IP on Linux

As shown in 6, Linux implements the TCP/IP protocol in a layered software structure. BSD sockets are supported by the general socket management software inet Sockets Layer. The inet socket manages the IP-based TCP or UDP protocol side. In transmitting the UDP datagram, Linux does not have to worry about whether the datagram arrives at the destination safely. However, for TCP datagrams, Linux needs to number datagrams, and the source and destination of the datagram need to work in coordination to ensure that datagrams are not lost or sent in the wrong order. The IP layer contains code that needs to handle the header information of the datagram, and the incoming datagram must be sent to the correct layer of processing in either TCP or UDP. Below the IP layer is the network device layer of Linux, which includes Ethernet devices or PPP devices. Unlike other devices in a Linux system, network devices do not always represent actual physical devices, for example, a loopback device is a purely software device. The ARP protocol provides address resolution, so it is between the IP layer and the network device layer.

Figure 6 Linux Network hierarchy diagram

6.1 Socket Buffers

Linux uses a socket buffer to transfer data between the protocol layer and the network device. Sk_buff contains pointers and length information that allows the protocol layer to process the application's data in a standard function or method. As shown in 7, each sk_buff contains a block of data, four data pointers, and two length fields. Using four data pointers, each protocol layer can manipulate and manage the data of the socket buffers, the four pointers are used as follows.

Head: Point to the start address of the in-memory data area. The value of the pointer is fixed after the Sk_buff and the associated data block are allocated.

Data: The current start address that points to the protocol data. The value of this pointer varies with the protocol layer that currently owns Sk_buff.

Tail: Point to the current end address of the protocol data. As with the data pointer, the value of the pointer changes depending on the protocol layer that currently owns the Sk_buff.

End: Points to the end of the in-memory data area. As with the head pointer, the value of the pointer is fixed when the Sk_buff is assigned.

Sk_buff's two length fields, Len and truesize, describe the length of the current protocol datagram and the actual length of the data buffer, respectively.

6.2 Receiving IP Datagrams

When a network device receives a datagram from the network, it must convert the received data into a SK_BUFF data structure and then add the structure to the backlog queue. When the backlog queue becomes large, the received Sk_buff data is discarded. When the new Sk_buff is added to the backlog queue, the network underlying program is flagged as ready, allowing the scheduler to dispatch the underlying program for processing.

The scheduler eventually runs the network's underlying handlers. At this point, the network underlying handler processes any datagrams waiting to be transmitted, but before that, the underlying handler first processes the backlog queue for the sk_buff structure. The underlying handler must determine which protocol layer to pass the received datagram to.

When Linux initializes the network layer, each protocol adds Packet_type data structures to the Ptype_all linked list or ptype_base hash table for registration. The Packet_type data structure contains the protocol type, a pointer to the network device, a pointer to the protocol's receive data-processing routine, and so on. Ptype_base is a hash table whose hash function takes the protocol identifier as the parameter, and the kernel usually uses the hash table to determine the protocol that should accept the incoming network datagram. By examining the Ptype_all list and the ptype_base hash table, the network underlying handler replicates the new Sk_buff and, ultimately, Sk_buff passes to one or more of the target protocol's processing routines.

6.3 Sending IP datagrams

The network processing code must establish a sk_buff to contain the data to be transferred, and when passing data between the protocol tiers, different protocol headers and protocol endings need to be added.

First, the IP protocol needs to decide which network device to use, and the choice of network device depends on the best route of the datagram. Routing is easier for computers that connect only with modems and PPP protocols, but routing is more complex for computers connected to Ethernet.

For each IP datagram to be transmitted, IP uses the routing table to resolve the route of the destination IP address. For each destination IP address that can be found from the routing table, the routing table returns a rtable data structure that describes the routes that can be used. This includes the source address to use, the address of the device data structure for the network devices, and the pre-established hardware header information. The hardware header information is related to the network device and contains the physical address of the source and destination as well as other media information.

6.4 Fragmentation and reorganization of datagrams

When the IP datagram is transmitted, IP addresses the network device that sent the IP datagram from the IP routing table, and the device data structure corresponding to the network equipment contains an MTU field that describes the largest transmission unit. If the MTU of the device is less than the size of the IP datagram waiting to be sent, the IP datagram needs to be divided into small fragments. Each fragment is represented by a sk_buff, where the IP header is labeled as a number of fragments, and the fragment is offset in the IP datagram. The last datagram is marked as the last IP fragment. If the IP cannot allocate sk_buff during fragmentation, the transfer fails.

The reception of IP fragments is more complicated than the sending of fragments, since the IP fragments can be received in any order, whereas before the reorganization, all fragments must be received. Each time the IP datagram is received, the IP checks to see if it is a segmented datagram. When a segmented message is first received, the IP establishes a new IPQ data structure and links it to the ipqueue linked list formed by the IP fragment waiting to be reorganized. As other IP fragments are received, the IP finds the correct IPQ data structure while establishing a new IPFRAG data structure that describes the fragment. Each IPQ data structure contains an identifier for its source and destination IP address, a high-level protocol, and an identifier for that IP frame, thus uniquely describing a segmented IP receive frame. When all the fragments are received, they are combined into a single sk_buff and passed to the upper level protocol layer processing. If the timer expires before all fragments arrive, the IPQ data structure and Ipfrag are discarded, and the message is assumed to have been lost in transit, the high-level protocol needs to request the source host to resend the lost information.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More