"TCP/IP Detailed Volume 2: implementation" notes--socket I/O

Source: Internet
Author: User
Tags bit set sendmsg socket truncated

This article describes the system calls to read and write data on a network connection, in three parts:

The first section describes the four system calls used to send data: Write,writev,sendto and sendmsg. The second part introduces four systems for receiving data

Called: Read, Readv, Recvfrom, and recvmsg. The third part introduces the select System call, the role of select calls is to monitor common descriptors and special descriptions

The state of the symbol.

The core of the socket layer is two functions: Sosend and Soreceive. These two functions handle I/O operations between all socket layers and protocol layers.


1. Socket Cache

Each socket has a send cache and an accept cache. The type of cache is sockbuf. The following figure lists the definitions of the SOCKBUF structure.


Where Sb_hiwat and Sb_lowat are used to adjust the socket flow control algorithm. Later sections of this article are explained. The following illustration shows the default settings for Internet protocols.


Because the source address of each incoming UDP message is queued with the data, the default value of the UDP protocol's Sb_hiwat is set to accommodate 40 1K bytes long

Packet and the corresponding sockaddr_in structure (16 bytes each).

Sb_sel is a selinfo structure used to implement a select system call.

The following figure lists all the possible values for the sb_flags.


The Sb_timeo is used to limit the time a process is blocked in a read-write call, in clock ticks. The default is 0, which indicates that the process waits indefinitely. So_sndtimeo

and the So_rcvtimeo socket option can change or read the value of the Sb_timeo.


2.write, Writev, SendTo, and sendmsg system calls

All of these write system calls are called directly or indirectly by Sosend. The function of Sosend is to copy the process data to the kernel and pass the data to the socket

The relevant protocol. The following figure shows the Sosend workflow.



The write and Writev system calls apply to any descriptor, while other system call values apply to the socket descriptor; Writev and SENDMSG system calls can accept

Data from multiple caches. Writing data from multiple caches is called collecting, and the corresponding read operation becomes decentralized. When a collection operation is performed, the kernel is received sequentially

The data in the cache specified in the array of type Iovec. The array has a maximum of uio_maxiov cells. The following figure shows the structure of the type Iovec.


Iov_base points to the start of a cache with a length of iov_len bytes.

Without such an interface, a process would have to replicate multiple caches to a large cache, or invoke multiple write system calls to send from multiple caches

Data. The following figure illustrates the application of the IOVEC structure to the WRITEV system call, in which IOVP points to the first element of the array, iocnt equals the size of the array.


The datagram protocol requires that each write call must specify a destination address. Because the write, Writev, and send call interfaces do not support the designation of destination addresses,

These calls can only be called when Connect connects the destination address to a disconnected socket. The purpose must be provided when calling SendTo or sendmsg

Address, or call connect before calling them to specify the destination address.

The following illustration shows that the SENDMSG system call receives an optional control flag.


Only sendmsg system calls support control information, control information, and several other parameters are passed through the structure MSGHDR once to sendmsg, rather than individually.


The type of control information is the CMSGHDR structure.


The following figure illustrates the structure of MSGHDR when calling Sendmsg.



3.sendmsg system Call

The sendmsg and Sendit functions prepare the Sosend system to invoke the required data structure, and then sosend call the message to the appropriate protocol. To Sock_dgram

Protocol, the message is a datagram, for the SOCK_STREAM protocol, the message is a stream of byte stream. For the Sock_seqpacket protocol, the message

May be a complete record or part of a large record. The declaration of the SENDMSG system call is as follows:


The approximate processing flow of the SENDMSG function is as follows:

1. Copy the Iovec array from the user space to an array in the stack or to a larger, dynamically allocated array.

2. Call the Sendit function.


4.sendit function

Sendit Initializes a UiO structure that copies control and address information from the process space to the kernel. First, we must first introduce the next UiO structure. The UiO structure contains the Iovec

Array of structures and some other information.

The approximate processing flow of the Sendit function is as follows:

1. Initialize the UiO structure. Collects data from the output cache specified by the process into the kernel cache.

2. Copy the address and control information from the process.

3. Send the data and clear the cache. The socket, destination address, UiO structure, control information and flags are all passed to the function sosend. Before returning, Sendit releases

The cache that contains the destination address. The sosend is responsible for releasing the control information cache.


5.sosned function

Sosend is one of the most complex functions in the socket layer. The functions of sosend are: according to the protocol supported by the socket to support the semantics and cache limitations, the data and control

Information is passed to the Pr_usrreq function of the protocol specified by the socket. Sosend never put data in the send cache, and the storage and removal of data should be done by protocol.

Reliable Protocol Caching

For a reliable data transfer protocol, the send cache holds data that has not yet been sent and data that has been sent but not yet confirmed. SB_CC equals Send Cache

The number of bytes of data, and 0<=sb_cc<=sb_hiwat. If there is out-of-band data sent, SB_CC may temporarily exceed Sb_hiwat.

Sosend should ensure that there is enough send cache before passing data to the protocol layer through the Pr_usrreq function. The protocol layer puts the data into the send cache, Sosend

Transfer data to the protocol layer in one of the following two ways:

1. If Pr_atomic is set (pr_flags in the PROTOSW structure), the sosend must protect the boundary between the process and the protocol layer. Sosend waiting to get enough

Cache to store the entire message, when sufficient cache is obtained, the MBUF chain that stores the entire message is constructed and transmitted to the Protocol layer at once with the Pr_usrreq function.

RDP and SPP are this type of protocol.

2. If Pr_atomic,sosend is not set to transmit a mbuf of a stored message each time, it may transmit some mbuf to the protocol layer to prevent exceeding the upper limit,

Sock_stream class protocols such as TCP and Sock_seqpacket class protocols are used in TP4.

When a message is too large to have enough cache, the protocol allows the message to be divided into multiple segments, but Sosend still does not pass the data to the protocol layer until the cache

The amount of idle space is greater than Sb_lowat.

Unreliable protocol caching

For protocol (UDP) that provides unreliable data transfer, the send cache does not need to save any data or wait for any acknowledgement. Once each message is

Queued for sending to the appropriate network device, the socket layer immediately sends it to the protocol. In this case, sb_cc always equals 0,sb_hiwat to specify every write

The maximum length that indirectly indicates the maximum length of the datagram.

The details of the Sosend function are described in four parts.

Initialization error and resource check data Transfer Protocol processing the approximate processing flow is as follows: 1. Calculate the number of bytes of transmitted data. 2. Turn off routing. If it is only required that such a message is not routed through the routing table, set not to route. 3. Error checking. In a few cases the error is returned: The jack output is forbidden. The socket is in a wrong state. The Protocol requests the connection and the connection has not been established or the connection request has not been started without specifying a destination address in the no-connection protocol. 4. Calculate the available space. Calculates the number of free space bytes remaining in the send cache in order to prevent too much tabloid text from consuming too much mbuf cache. Limit the cache to 1024 bytes to give the out-of-band data a higher priority. 5. Enforce the message size limit. If the atomic is set and the message is larger than the high water mark (Sb_hiwat), an error is returned. The message is too large to be accepted by the protocol, even if the cache is empty. If the control information is longer than the high water mark, the error is also returned. 6. Whether to wait for more space. If there is not enough space in the send cache, the data originates from the process, and one of the following conditions is true, then sosend must wait for more space: The message must be delivered to the protocol at once. Messages can be transmitted in segments, but the size of the unused space is lower than the low water mark. Messages can be transmitted in segments, but the available space holds no small control information. Data is already in the MBUF cache when the data is passed to the sosend through top (pointing to the MBUF data chain). Therefore, Sosend ignores cache high, low water mark limits because no additional cache is required to hold the data. If the sosend must wait for the cache and the socket is non-blocking, an error is returned. Otherwise, the cache lock is freed, Sosend calls Sbwait waits until the cache state changes, and when Sbwait returns, Sosend re-enables the protocol to process and regain cache locks, checking for errors and cache space. 7. Assign the group header or standard mbuf. When atomic is set, a packet header is assigned for the first time and then the standard Mbuf is assigned. If the atomic is not set, a packet header is always assigned. 8. Use clusters as much as possible. Without clustering, the number of bytes stored in MBUF is limited by the smallest of the following three quantities:

Free space in the mbuf. The number of bytes in the message. The cached space. 9. Copy data from the process. Copies bytes from the process to mbuf. After the transfer is complete, update the length of the Mbuf, the front mbuf connected to the new mbuf, and the length of the Mbuf chain. 10. Whether to write another cache. When atomic is not set, only one mbuf is sent to the protocol at a time. If atomic is set, cached writes are made only when sufficient cache space is available to hold the entire message. 11. Transmit data and control the protocol specified by the MBUF to the socket. If the process transmits out-of-band data, the PRU_SENDOOB request is sent; otherwise, it sends the pru_send request, and the address and control Mbuf to the protocol.


6.read, Readv, Recvfrom, and recvmsg system calls

We make these system calls a read system call to receive data from a network connection, and the first three system calls are relatively simple compared to recvmsg. The following figure shows

These four system calls and a library function recv features.


Only the read and READV system calls apply to various descriptors, and the other calls apply only to the socket descriptor. As with write calls, you can use the IOVEC structure array to

Specify multiple caches. For datagram protocols, Recvfrom and Recvmsg return the source address of each received datagram. For connection-oriented protocols, Getpeername

Returns the address of the connecting person.

The following diagram illustrates the process of reading system calls.



7.recvmsg system Call

The RECVMSG function is the most common read system call. The approximate processing flow of the function is as follows:

1. Copy the IOV array. As with sendmsg, recvmsg copies the MSGHDR structure to the kernel, and if the automatically assigned array Aiov is too small, assign a larger

Iovec array, and copies the array cells from the process to the kernel array that Iov points to.

2.recvit and release the cache. Recvit after the data has been collected, the updated cache length and MSGHDR structure of the flag are copied to the process. If a larger Iovec is assigned

Structure, it is released before it is returned.


8.recvit function

The Recvit function is called by recv, Recvfrom, and Recvmsg, based on the RECV structure provided by the MSGHDR XXX call, and the Recvit function prepares a

UiO structure.

The approximate processing flow of the Recvit function is as follows:

1. Initialize the UiO structure, which describes a data transfer from the kernel to the process.

2. Call the Soreceive function.

3. Copy the address and control information to the process. If a process passes in an address or control information or a cache of both, Recvit writes the result to the cache.

and adjust their lengths based on the results returned by Soreceive. If the cache is too small, the address information may be truncated.

4. Release the MBUF cache that stores the source address and control information.


9.soreceive function

The Soreceive function transfers data from the socket's receive cache to the cache specified by the process. Some protocols also provide the sender's address, which can be used in conjunction with possible additional control

Information is returned together.

Recvmsg is the only read system call that returns a flag field to a process. In other system calls, the information is discarded by the kernel before the control is returned to the process, and the following image

Lists the flags that RECVMSG can set in MSGHDR.


9.1. Out-of-band data

Out-of-band data (OOB) has different meanings in different protocols. In general, the Protocol uses the established communication connection to send OOB data. OOB data May

The same order as the normal data sent. The socket layer supports two protocol-independent mechanisms for processing OOB data: Tagging and synchronization. This paper discusses the implementation of the socket layer

An abstract OOB mechanism. UDP does not support OOB data. There is a relationship between the emergency data mechanism of TCP and the OOB data of the socket layer.

The Send process flags the data as OOB data through the SENDXXX call set Msg_oob flag. Sosend this message to the socket protocol, the socket layer receives this

message, special processing of the data, such as speeding up the sending of data or using another queuing strategy.

When an agreement receives OOB data, it is not placed in the socket's receive cache but is placed elsewhere. Process by setting the MSG_OOB flag in the RECVXXX call

To receive the incoming OOB data. Alternatively, by setting the so_oobinline socket option, the receiving process can require the protocol to place the OOB data in a normal

Within the data. When So_oobinline is set, the protocol receives the OOB data into the receive cache for normal data, in which case Msg_oob does not have to

Receives the OOB data, the read call either returns all the normal data, or returns all the OOB data. Two types of data never enter the input of a call again

Confusion in the cache. When a process uses recvmsg to receive data, it is possible to check the MSG_OOB flag to determine whether the returned data is normal or OOB data.


9.2. Receiving the cached organization: message boundaries

For protocols that support message boundaries, each message is stored in a mbuf chain. The following figure illustrates the structure of a UDP receive cache consisting of three mbuf.



9.3. Receiving the cached organization: no message boundaries

When the protocol does not need to maintain the message boundary (and the SOCK_STREAM protocol, such as TCP), the data is added to the end of the last Mbuf chain in the cache. If you enter

Data length is greater than the length of the cache, the data is truncated. The following figure illustrates the structure of a TCP receive cache that contains only normal data.



9.4. Control information and out-of-band data

Unlike TCP, some stream protocols support control information and add data about the control information as a new MBUF chain to the receive cache, if the protocol supports the inclusion

OOB data, a new MBUF chain is inserted after any mbuf that contains the OOB data, but before any mbuf that contain normal data, this ensures access to the

OOB data is always in front of normal data. The following figure illustrates the structure of the accept cache that contains control information and OOB data.



10.soreceive Function Code

When receiving data, the soreceive must check the message boundaries, handle the address and control information, and any special operations specified by the read flag. Generally speaking, soreceive

Only one record is processed and the number of bytes required to read is returned as much as possible. The approximate processing flow of the function is as follows:

1. Receive OOB data because OOB data is not stored in the receive cache, so soreceive assigns it a standard MBUF and sends the protocol PRU_RCVOOB

Request. The while loop copies the data returned by the protocol into the specified cache.

2. If necessary, wait for the data. Soreceive to check several situations, it may be necessary to wait for more data to execute if needed. If soreceive

When you go to sleep here, it wakes up to see if there is enough data to arrive. This process continues until enough data is received.

3. Process address and control information. Address information and control information are processed first before transmission.

4. Establish data transfer. Because only OOB data or normal data is delivered in a single soreceive call, you must remember the data type of the preceding segment of the queue, so that the type

When changed, the soreceive can stop transmitting.

5. Transfer the data loop. As long as there is mbuf in the cache, the requested data is not delivered, and no error occurs, and the loop does not exit.

6. Exit processing. The main is to update pointers, flags and offsets, release the socket cache lock, enable protocol processing and return.


11.SELECCT System call

The following figure lists the socket states that select can monitor.

The approximate processing of the Select function is the file descriptor indicated by the scan process, which is returned when one or more descriptors are in the ready state or when the timer expires or the signal appears.


11.1.selscan function

The core of the Select function is the Selscan function. For each bit set in any descriptor collection, Selscan finds the descriptor associated with it, and

The control is dispersed to the So_select function associated with the description typeface. For the socket, it is the Soo_select function.


11.2.soo_select function

For each state-ready descriptor Selscan found in the input descriptor collection, Selscan invokes the Fo_select pointer in the FILEOPS structure associated with the descriptor

The referenced function. The function determines the readable, writable, or exceptional condition of the socket and invokes the SelRecord function.


11.3.selrecord function

This function logs enough information to allow the protocol processing layer to wake the process when the cache content changes.


11.4.selwakeup function

When the protocol processing changes the state of the socket cache, and only one process chooses the cache, NET/3 can immediately change the process according to the information recorded in the SelRecord

into the run queue.

Each process that invokes select uses Selwait as the wait channel when calling Tsleep. This means that the corresponding wakeup will wake up all the blocks on the Select

Process.

The following figure illustrates how to call Selwakeup.

The protocol processing layer is responsible for invoking the functions listed at the bottom of the diagram to notify the socket layer when an event that changes the status of the socket appears. These functions cause selwakeup to be called,

Any process selected on the socket is scheduled to run.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.