IP is the most core protocol in the TCP/IP protocol family. All TCP, UDP, ICMP, and IGMP data are transmitted in the form of IP datagram.
IP is an unreliable protocol, that is, it does not guarantee that each IP datagram can successfully reach the destination, but only provides the best transmission service. If an error occurs (for example, a vro has used up the buffer temporarily), the IP address has a simple error processing algorithm, that is, discarding the datagram and then sending an ICMP message to the sender. The processing of each datagram is independent of each other. Therefore, IP datagram can be received in the order of sending. Any reliability must be provided by the upper-layer protocol, such as TCP
The input, output, and forwarding of IP datagram control information processing and management of peer information blocks involve the following files:
Include/NET/IP. h defines the structure, Macro, and function prototype related to the IP layer
Include/Linux/inetdevice. h defines the structure and macros related to IPv4-specific network devices.
Include/Linux/errqueue. h defines the structure related to error handling
Include/NET/inetpeer. h defines the structure, Macro, and function prototype of the peer information block.
Include/NET/DST. h define the structure, Macro, and function prototype of the destination route Cache
Net/IPv4/ip_output.c IP datagram output
Net/IPv4/ip_sockglue.c IP layer interface Option
Net/IPv4/ip_input.c IP datagram Input
Net/IPv4/ip_forward.c IP datagram forwarding
Net/IPv4/inetpeer. c peer information block management
Net/IPv4/af_inet.c network layer and transport layer Interfaces
IP datagram Message format
Refer to TCP/IP protocol study notes (3) Internet Protocol (IP)
IP datagram Input and Output
The network layer is between the transport layer and the link layer, and also needs to deal with route tables and neighbor subsystems. When entering data, provide the input interface to the link layer for calling, and call the input interface of the transport layer to transfer data to the transport layer. When data is output, an output interface is provided to the transport layer for calling, and the output interface of the link layer is called to output data to the link layer. During the Input and Output Processes, you must search for routes and process them using netfilter.
Private Information Control Block of IP
The IP layer has an information control block struct inet_skb_parm structure in SKB, which is stored in the CB member in the skb_buff structure. The IP layer uses macro IPCB to access this structure to enhance code readability. This private information control block mainly stores the IP option and the flag to be set in IP address processing.
At the IP layer, IP options must be processed for both input and output. For example, during input, ip_rcv_option () will parse the options in the IP header and save them to the opt of the inet_skb_parm structure; During output, ip_options_build () according to the opt of the inet_skb_parm structure, it is organized and then generated in the IP header. During forwarding, ip_forward_options () is processed as appropriate based on the options.
struct inet_skb_parm{struct ip_optionsopt;/* Compiled IP options*/unsigned charflags;#define IPSKB_FORWARDED1#define IPSKB_XFRM_TUNNEL_SIZE2#define IPSKB_XFRM_TRANSFORMED4#define IPSKB_FRAG_COMPLETE8#define IPSKB_REROUTED16};#define IPCB(skb) ((struct inet_skb_parm*)((skb)->cb))
IP layer set interface option entry is ip_setsockopt (). This function first determines the option level. If it is not the sol_ip level, an invalid protocol is returned. Otherwise, do_ip_setsockopt () is called to process specific options. In do_ip_setsockopt (), ip_mroute_opt () is used for Multicast Route-related options. Other options are processed by this function.
The options of the IP layer interface are as follows:
1) ip_options
Set or obtain the IP option in the IP header of each datagram sent by the set interface. The maximum value is 40B. This parameter is a pointer to the storage buffer with options and options included in the datagram.
2) ip_pktinfo
Determines whether ip_pktoptions related to local address information can be obtained through the ip_pktoptions option or the recvmsg system call.
3) ip_ttl
Set the life time of the output IP datagram. Valid values: 1 to 255. Each time an IP datagram passes through a vro, The vro determines the TTL value and decreases the value by 1. Once it is found that it is 0, the datagram is discarded to avoid route loops, causing endless "Wandering" of datagram in the Network
The specific options are as follows:
Include/Linux/in. h
#define IP_TOS1#define IP_TTL2#define IP_HDRINCL3#define IP_OPTIONS4#define IP_ROUTER_ALERT5#define IP_RECVOPTS6#define IP_RETOPTS7#define IP_PKTINFO8#define IP_PKTOPTIONS9#define IP_MTU_DISCOVER10#define IP_RECVERR11#define IP_RECVTTL12#defineIP_RECVTOS13#define IP_MTU14#define IP_FREEBIND15#define IP_IPSEC_POLICY16#define IP_XFRM_POLICY17#define IP_PASSSEC18#define IP_TRANSPARENT19/* BSD compatibility */#define IP_RECVRETOPTSIP_RETOPTS/* TProxy original addresses */#define IP_ORIGDSTADDR 20
The ipv4_devconf00004_devconf structure is the IPv4 system configuration of the network device interface. In the kernel, there is a system global variable named ipv4_devconf, which is valid for all interfaces. In addition, the IP control block of each network device also has a configuration, but this configuration is only valid for the network device.
enum{NET_IPV4_CONF_FORWARDING=1,NET_IPV4_CONF_MC_FORWARDING=2,NET_IPV4_CONF_PROXY_ARP=3,NET_IPV4_CONF_ACCEPT_REDIRECTS=4,NET_IPV4_CONF_SECURE_REDIRECTS=5,NET_IPV4_CONF_SEND_REDIRECTS=6,NET_IPV4_CONF_SHARED_MEDIA=7,NET_IPV4_CONF_RP_FILTER=8,NET_IPV4_CONF_ACCEPT_SOURCE_ROUTE=9,NET_IPV4_CONF_BOOTP_RELAY=10,NET_IPV4_CONF_LOG_MARTIANS=11,NET_IPV4_CONF_TAG=12,NET_IPV4_CONF_ARPFILTER=13,NET_IPV4_CONF_MEDIUM_ID=14,NET_IPV4_CONF_NOXFRM=15,NET_IPV4_CONF_NOPOLICY=16,NET_IPV4_CONF_FORCE_IGMP_VERSION=17,NET_IPV4_CONF_ARP_ANNOUNCE=18,NET_IPV4_CONF_ARP_IGNORE=19,NET_IPV4_CONF_PROMOTE_SECONDARIES=20,NET_IPV4_CONF_ARP_ACCEPT=21,NET_IPV4_CONF_ARP_NOTIFY=22,NET_IPV4_CONF_SRC_VMARK=24,__NET_IPV4_CONF_MAX};struct ipv4_devconf{void*sysctl;intdata[__NET_IPV4_CONF_MAX - 1];DECLARE_BITMAP(state, __NET_IPV4_CONF_MAX - 1);};
Net_ipv4_conf_forwarding
Whether IP datagram Forwarding is enabled
Net_ipv4_conf_proxy_arp
Indicates whether ARP proxy is enabled.
Set interface error queue
In the transmission control block, there is a queue sk_error_queue used to save the error message. When ICMP receives the error message or UDP interface and raw interface output packet error, the SKB that generates the description error is added to the queue. In order to obtain detailed error information through system calls, the application needs to set ip_recverr Interface Options, and then obtain detailed error information through the recvmsg system call with the flags parameter msg_errqueue.
Error message data streams and function call relationships
Ip_recverr indicates different connection-based interfaces. Instead of saving the error message to the error queue, it immediately transmits all received error messages to the user process. This is useful for short-connection-based TCP applications because TCP requires fast error handling. Note that there is no error queue for TCP, and msg_errqueue is invalid for connection-based interfaces.
When an error message is transmitted to a user, it is not transmitted to the user process as the message content, but stored in the SKB control block as an error information block, access the error information block in the SKB control block through skb_ext_err
#define SKB_EXT_ERR(skb) ((struct sock_exterr_skb *) ((skb)->cb))struct sock_exterr_skb{union {struct inet_skb_parmh4;#if defined(CONFIG_IPV6) || defined (CONFIG_IPV6_MODULE)struct inet6_skb_parmh6;#endif} header;struct sock_extended_erree;u16addr_offset;__be16port;};
This error is also handled at the IP layer. To work with IP control blocks, the front part of the Error Block is composed of IP control blocks, followed by error messages.
Add ICMP error message
When an ICMP error message is received, the ICMP module calls the error handling routine of the transport layer based on the transport layer protocol of the original datagram, the error handling routine at the transport layer will call ip_icmp_error () to add the error information to the Transmission Control Block Error queue that outputs the error datagram.
void ip_icmp_error(struct sock *sk, struct sk_buff *skb, int err, __be16 port, u32 info, u8 *payload){struct inet_sock *inet = inet_sk(sk);struct sock_exterr_skb *serr;if (!inet->recverr)return;skb = skb_clone(skb, GFP_ATOMIC);if (!skb)return;serr = SKB_EXT_ERR(skb);serr->ee.ee_errno = err;serr->ee.ee_origin = SO_EE_ORIGIN_ICMP;serr->ee.ee_type = icmp_hdr(skb)->type;serr->ee.ee_code = icmp_hdr(skb)->code;serr->ee.ee_pad = 0;serr->ee.ee_info = info;serr->ee.ee_data = 0;serr->addr_offset = (u8 *)&(((struct iphdr *)(icmp_hdr(skb) + 1))->daddr) - skb_network_header(skb);serr->port = port;if (skb_pull(skb, payload - skb->data) != NULL) {skb_reset_transport_header(skb);if (sock_queue_err_skb(sk, skb) == 0)return;}kfree_skb(skb);}
SK, the transmission control block that outputs the error Datagram
SKB, the ICMP error message transmitted from the ICMP module to the transport layer
Err, error code
Port. UDP is the destination port of the error message. Otherwise, it is 0.
Info, extended error information
Payload, pointing to the content of the application layer in the original datagram that generates an ICMP Error
Add error information generated locally
When UDP or raw APIs send data, if the length of the data to be sent exceeds the Load Length of the IP datagram, ip_local_error () is called () add the excessive error information of the datagram data to the Transmission Control Block Error queue that outputs the error message. The implementation of ip_local_error () is similar to that of ip_icmp_error (). The difference is that the error information is obtained locally.
void ip_local_error(struct sock *sk, int err, __be32 daddr, __be16 port, u32 info){struct inet_sock *inet = inet_sk(sk);struct sock_exterr_skb *serr;struct iphdr *iph;struct sk_buff *skb;if (!inet->recverr)return;skb = alloc_skb(sizeof(struct iphdr), GFP_ATOMIC);if (!skb)return;skb_put(skb, sizeof(struct iphdr));skb_reset_network_header(skb);iph = ip_hdr(skb);iph->daddr = daddr;serr = SKB_EXT_ERR(skb);serr->ee.ee_errno = err;serr->ee.ee_origin = SO_EE_ORIGIN_LOCAL;serr->ee.ee_type = 0;serr->ee.ee_code = 0;serr->ee.ee_pad = 0;serr->ee.ee_info = info;serr->ee.ee_data = 0;serr->addr_offset = (u8 *)&iph->daddr - skb_network_header(skb);serr->port = port;__skb_pull(skb, skb_tail_pointer(skb) - skb->data);skb_reset_transport_header(skb);if (sock_queue_err_skb(sk, skb))kfree_skb(skb);}
Read error information
Generally, recvmsg () is used to receive data sent from the remote end to the set of interfaces. However, you can also set flags to msg_errqueue to read error messages in the Transmission Control Block Error queue. In the recvmsg () Implementation of UDP and raw interfaces, check whether the msg_errqueue mark exists. If yes, call ip_recv_error () directly () read the error message from the error queue of the transmission control block and return it.
/* *Handle MSG_ERRQUEUE */int ip_recv_error(struct sock *sk, struct msghdr *msg, int len){struct sock_exterr_skb *serr;struct sk_buff *skb, *skb2;struct sockaddr_in *sin;struct {struct sock_extended_err ee;struct sockaddr_in offender;} errhdr;int err;int copied;err = -EAGAIN;skb = skb_dequeue(&sk->sk_error_queue);if (skb == NULL)goto out;copied = skb->len;if (copied > len) {msg->msg_flags |= MSG_TRUNC;copied = len;}err = skb_copy_datagram_iovec(skb, 0, msg->msg_iov, copied);if (err)goto out_free_skb;sock_recv_timestamp(msg, sk, skb);serr = SKB_EXT_ERR(skb);sin = (struct sockaddr_in *)msg->msg_name;if (sin) {sin->sin_family = AF_INET;sin->sin_addr.s_addr = *(__be32 *)(skb_network_header(skb) + serr->addr_offset);sin->sin_port = serr->port;memset(&sin->sin_zero, 0, sizeof(sin->sin_zero));}memcpy(&errhdr.ee, &serr->ee, sizeof(struct sock_extended_err));sin = &errhdr.offender;sin->sin_family = AF_UNSPEC;if (serr->ee.ee_origin == SO_EE_ORIGIN_ICMP) {struct inet_sock *inet = inet_sk(sk);sin->sin_family = AF_INET;sin->sin_addr.s_addr = ip_hdr(skb)->saddr;sin->sin_port = 0;memset(&sin->sin_zero, 0, sizeof(sin->sin_zero));if (inet->cmsg_flags)ip_cmsg_recv(msg, skb);}put_cmsg(msg, SOL_IP, IP_RECVERR, sizeof(errhdr), &errhdr);/* Now we could try to dump offended packet options */msg->msg_flags |= MSG_ERRQUEUE;err = copied;/* Reset and regenerate socket error */spin_lock_bh(&sk->sk_error_queue.lock);sk->sk_err = 0;skb2 = skb_peek(&sk->sk_error_queue);if (skb2 != NULL) {sk->sk_err = SKB_EXT_ERR(skb2)->ee.ee_errno;spin_unlock_bh(&sk->sk_error_queue.lock);sk->sk_error_report(sk);} elsespin_unlock_bh(&sk->sk_error_queue.lock);out_free_skb:kfree_skb(skb);out:return err;}
Peer Information Block
The peer information block is described by the inet_peer structure to save the peer information, including the peer address and the timestamp used by the transport layer. It is mainly used to prevent sharding attacks when assembling IP data reports. When establishing a TCP connection, it checks whether the connection request segment is valid and whether its serial number is rewound.
struct inet_peer{/* group together avl_left,avl_right,v4daddr to speedup lookups */struct inet_peer*avl_left, *avl_right;__be32v4daddr;/* peer's address */__u16avl_height;__u16ip_id_count;/* IP ID for the next packet */struct list_headunused;__u32dtime;/* the time of last use of not * referenced entries */atomic_trefcnt;atomic_trid;/* Frag reception counter */__u32tcp_ts;unsigned longtcp_ts_stamp;};
Create and search peer information blocks
The creation and search of peer information blocks are implemented through inet_getpeer (). The create parameter is used to identify whether to create or search.
/* Called with or without local BH being disabled. */struct inet_peer *inet_getpeer(__be32 daddr, int create){struct inet_peer *p, *n;struct inet_peer **stack[PEER_MAXDEPTH], ***stackptr;/* Look up for the address quickly. */read_lock_bh(&peer_pool_lock);p = lookup(daddr, NULL);if (p != peer_avl_empty)atomic_inc(&p->refcnt);read_unlock_bh(&peer_pool_lock);if (p != peer_avl_empty) {/* The existing node has been found. *//* Remove the entry from unused list if it was there. */unlink_from_unused(p);return p;}if (!create)return NULL;/* Allocate the space outside the locked region. */n = kmem_cache_alloc(peer_cachep, GFP_ATOMIC);if (n == NULL)return NULL;n->v4daddr = daddr;atomic_set(&n->refcnt, 1);atomic_set(&n->rid, 0);n->ip_id_count = secure_ip_id(daddr);n->tcp_ts_stamp = 0;write_lock_bh(&peer_pool_lock);/* Check if an entry has suddenly appeared. */p = lookup(daddr, stack);if (p != peer_avl_empty)goto out_free;/* Link the node. */link_to_pool(n);INIT_LIST_HEAD(&n->unused);peer_total++;write_unlock_bh(&peer_pool_lock);if (peer_total >= inet_peer_threshold)/* Remove one less-recently-used entry. */cleanup_once(0);return n;out_free:/* The appropriate node is already in the pool. */atomic_inc(&p->refcnt);write_unlock_bh(&peer_pool_lock);/* Remove the entry from unused list if it was there. */unlink_from_unused(p);/* Free preallocated the preallocated node. */kmem_cache_free(peer_cachep, n);return p;}
Delete peer information blocks
After the peer information block is used, delete and release it. In fact, inet_putpeer () only adds the peer information block to the unused_peers linked list, indicating that the peer information block is not used currently. The real deletion and release are handled by the garbage collector.
void inet_putpeer(struct inet_peer *p){spin_lock_bh(&inet_peer_unused_lock);if (atomic_dec_and_test(&p->refcnt)) {list_add_tail(&p->unused, &unused_peers);p->dtime = (__u32)jiffies;}spin_unlock_bh(&inet_peer_unused_lock);}
Garbage Collection
There are two methods to handle garbage collection: synchronous and asynchronous. The synchronous mode is usually triggered when the current number of peer information blocks reaches inet_peer_threshold when the peer information blocks are created, while the asynchronous mode is triggered when the timer times out.
The currently used peer information block does not expire. Only idle peer information blocks may expire, because once the peer information block is idle, it will be added to the unused_peers linked list and the idle time will be recorded. During synchronous or asynchronous cleanup, the peer information block will expire once the idle time reaches the threshold. The expiration of the peer information block is related to inet_peer_minttl, inet_peer_maxttl, and inet_peer_threshold.
Synchronization cleaning
Struct inet_peer * inet_getpeer (_ be32 daddr, int create)
{
...
If (peer_total> = inet_peer_threshold)
/* Remove one less-recently-used entry .*/
Cleanup_once (0 );
...
}
Asynchronous cleaning
Synchronous recovery and special cases used to handle high memory pressure. In fact, this situation is rare, and it also has a great impact on performance. Therefore, to avoid this special situation or reduce its probability of occurrence, the peer_periodic_timer timer is used for periodic recovery. The timer processing routine is peer_check_expire (). The interval is set in inet_initpeers ().
/* Called with local BH disabled. */static void peer_check_expire(unsigned long dummy){unsigned long now = jiffies;int ttl;if (peer_total >= inet_peer_threshold)ttl = inet_peer_minttl;elsettl = inet_peer_maxttl- (inet_peer_maxttl - inet_peer_minttl) / HZ *peer_total / inet_peer_threshold * HZ;while (!cleanup_once(ttl)) {if (jiffies != now)break;}/* Trigger the timer after inet_peer_gc_mintime .. inet_peer_gc_maxtime * interval depending on the total number of entries (more entries, * less interval). */if (peer_total >= inet_peer_threshold)peer_periodic_timer.expires = jiffies + inet_peer_gc_mintime;elsepeer_periodic_timer.expires = jiffies+ inet_peer_gc_maxtime- (inet_peer_gc_maxtime - inet_peer_gc_mintime) / HZ *peer_total / inet_peer_threshold * HZ;add_timer(&peer_periodic_timer);}
IP: Internet Protocol