Implementation of IP protocol (V4) in Linux (V)

Source: Internet
Author: User
Tags htons

This article mainly introduces some IP layer management and statistics related things.

First, let's look at the long-living IP peer information.

We know that the IP protocol is a stateless protocol. Here, the kernel saves some information for each destination IP address (in other words, the host that has communicated with the local machine) to improve performance.

The peer subsystem is generally used by the TCP or routing subsystem.

The data structure of this information is inet_peer, which is an AVL Tree, and the key of each node is an IP address. Because it is an AVL Tree, every search is O (lg n ):

Java code
Struct inet_peer
{
/// Left and right subtree of the AVL Tree
Struct inet_peer * avl_left, * avl_right;
/// IP address of the remote peer
_ Be32 v4daddr;/* peer's address */
/// Tree height
_ 2010avl_height;
/// The next package ID that uses this peer (our package ID is based on this domain, that is, each time an IP address is passed in, to obtain the ID (using the inet_getid function) currently used )).
_ 2010ip_id_count;/* ip id for the next packet */
/// This linked list contains all the peer whose timer expires (due to the memory size limit during peer initialization, therefore, we need to regularly put unused peers in this linked list for a given time ). here, it will be removed from unused only when its reference count is 0.
Struct list_head unused;
/// The time when the inet_peer element is added to the unused linked list (through inet_putpeer.
_ U32 dtime;/* The time of last use of not
 
/// Reference count * referenced entries */
Atomic_t refcnt;
/// The frame end counter.
Atomic_t RID;/* frag partition tion counter */
/// The following two are used by TCP to manage timestamps.
_ U32 tcp_ts;
Unsigned long tcp_ts_stamp;
};

The initialization of the Peer subsystem is carried out in inet_initpeers, which is called by ip_init, the initialization function of the IPv4 protocol. The main tasks of this function include:

1 allocate a cache for storing inet_peer data.

2. Define the maximum memory limit that can be used by inet_peer.

3. Enable the GC timer.

Java code
/// Memory limit
Extern int inet_peer_threshold;

/// Cache
Static struct kmem_cache * peer_cache__ read_mostly;
/// Timer. The processing function is peer_check_expire. We will introduce this function later.
Static define_timer (peer_periodic_timer, peer_check_expire, 0, 0 );
/// The corresponding read/write lock.
Static define_rwlock (peer_pool_lock );
Void _ init inet_initpeers (void)
{
Struct sysinfo Si;

/* Use the straight interface to information about memory .*/
Si_meminfo (& Si );

/// Obtain some information about the system. Here we mainly use the memory information. Therefore, we assign values to inet_peer_threshold based on the total memory size.
If (Si. totalram <= (32768*1024)/page_size)
Inet_peer_threshold> = 1;/* Max pool size about 1 MB on ia32 */
If (Si. totalram <= (16384*1024)/page_size)
Inet_peer_threshold> = 1;/* about 512kb */
If (Si. totalram <= (8192*1024)/page_size)
Inet_peer_threshold >>=2;/* about 128kb */

/// Create a cache.
Peer_caclap = kmem_cache_create ("inet_peer_cache ",
Sizeof (struct inet_peer ),
0, slab_hwcache_align | slab_panic,
Null );
/// Initialize the timer.
Peer_periodic_timer.expires = jiffies
+ Net_random () % inet_peer_gc_maxtime
+ Inet_peer_gc_maxtime;
Add_timer (& peer_periodic_timer );
}

The core function of the peer system is inet_getpeer, which is an interface provided to other subsystems. It encapsulates the lookup function, and the loopup function is just a simple AVL Tree lookup function.

The inet_getpeer function can do this by passing in a key (that is, an IP address) and a flag (for example, assigning a value to create). When the search fails, you can create a new tree picking point and initialize the ID of the IP package (using the secure_ip_id of the ID module) to initialize the node.

First, let's look at its call diagram, and then analyze the entire function:


 

Only one thing to note here is that we checked whether the peer exists and checked twice. This is before the second lock is obtained and after the first lock is released, during this period, a new peer may be added.

Java code
Struct inet_peer * inet_getpeer (_ be32 daddr, int create)
{
Struct inet_peer * P, * N;
Struct inet_peer ** stack [peer_maxdepth], *** stackptr;

/// Check whether the peer node exists.
Read_lock_bh (& peer_pool_lock );
P = Lookup (daddr, null );
/// If yes, add 1 to the reference count.
If (P! = Peer_avl_empty)
Atomic_inc (& P-> refcnt );
Read_unlock_bh (& peer_pool_lock );

If (P! = Peer_avl_empty ){
/// If the node is in unused, remove it from unused and return
Unlink_from_unused (P );
Return P;
}

/// If the create parameter is null, null is returned.
If (! Create)
Return NULL;

/// Start to create a new peer node.
N = kmem_cache_alloc (peer_cachu, gfp_atomic );
If (n = NULL)
Return NULL;
N-> v4daddr = daddr;
Atomic_set (& N-> refcnt, 1 );
Atomic_set (& N-> RID, 0 );
/// Obtain the appropriate package ID.
N-> ip_id_count = secure_ip_id (daddr );
N-> tcp_ts_stamp = 0;

Write_lock_bh (& peer_pool_lock );
/* Check if an entry has suddenly appeared .*/
P = Lookup (daddr, stack );
If (P! = Peer_avl_empty)
Goto out_free;

/// Add to AVL Tree.
Link_to_pool (N );
/// Initialize its unused linked list.
Init_list_head (& N-> unused );
Peer_total ++;
Write_unlock_bh (& peer_pool_lock );
/// If the memory exceeds the limit, remove the elements in the linked list header (that is, the LRU algorithm. We will analyze the cleanup_once function later.
If (peer_total> = inet_peer_threshold)
/* Remove one less-recently-used entry .*/
Cleanup_once (0 );

Return N;

Out_free:
........................................
}

Next let's look at the clean_once function. This function will not only be called by inet_getpeer, but also by peer_periodic_timer:

Java code

Static int cleanup_once (unsigned long TTL)
{
Struct inet_peer * P = NULL;

/* Remove the first entry from the list of unused nodes .*/
Spin_lock_bh (& inet_peer_unused_lock );
If (! List_empty (& unused_peers )){
_ U32 delta;

P = list_first_entry (& unused_peers, struct inet_peer, unused );

/// Calculate the last time the Peer was used (that is, the reference count of the operation.
Delta = (_ u32) jiffies-p-> dtime;

/// If the time is smaller than the imported TTL, no operation is performed. return directly (this TTL indicates how long it takes for an element in the unused linked list to wait before deletion ). in the above inet_getpeer, 0 is passed in, which will directly Delete the first peer.
If (delta <TTL ){
/* Do not prune fresh entries .*/
Spin_unlock_bh (& inet_peer_unused_lock );
Return-1;
}

List_del_init (& P-> unused );
/// Reference count-1.
Atomic_inc (& P-> refcnt );
}
Spin_unlock_bh (& inet_peer_unused_lock );

If (P = NULL)
/* It means that the total number of used entries has
* Grown over inet_peer_threshold. It shouldn't really
* Happen because of entry limits in route cache .*/
Return-1;
/// This function briefly introduces the reference count of P. If the reference count is 1, you can delete it from the AVL tree and then completely free it. when the quote technology is not 1, it will be directly added to the unused linked list (note that it is not deleted from the AVL Tree ).
Unlink_from_pool (P );
Return 0;
}

 

Next let's look at the timer processing function:

Java code
Static void peer_check_expire (unsigned long dummy)
{
Unsigned long now = jiffies;
Int TTL;

/// If the memory is too large, set TTL to the minimum value.
If (peer_total> = inet_peer_threshold)
TTL = inet_peer_minttl;
Else
/// In fact, the TTL is set based on the memory peer_total used.
TTL = inet_peer_maxttl
-(Inet_peer_maxttl-inet_peer_minttl)/Hz *
Peer_total/inet_peer_threshold * Hz;
While (! Cleanup_once (TTL )){
If (jiffies! = Now)
Break;
}

/// Note that our timer time is adjusted according to the current memory peer_total.
If (peer_total> = inet_peer_threshold)
Peer_periodic_timer.expires = jiffies + inet_peer_gc_mintime;
Else
Peer_periodic_timer.expires = jiffies
+ Inet_peer_gc_maxtime
-(Inet_peer_gc_maxtime-inet_peer_gc_mintime)/Hz *
Peer_total/inet_peer_threshold * Hz;
Add_timer (& peer_periodic_timer );
}

 

Then let's take a look at the implementation of the IP header Id field in the kernel (that is, the selection of the IP package ID ).

The function implementing this in the kernel is _ ip_select_ident. Generally, we call ip_select_ident, which is the packaging function of ip_select_ident, this function only judges the next DF bit (mainly to handle the Win95 bug) and then calls _ ip_select_ident.

Let's look at the implementation:
Java code
Void _ ip_select_ident (struct iphdr * IPH, struct dst_entry * DST, int more)
{
Struct rtable * RT = (struct rtable *) DST;

If (RT ){
/// If the peer is empty, call rt_bind_peer to create a new peer.
If (RT-> peer = NULL)
Rt_bind_peer (RT, 1 );

/* If peer is attached to destination, it is never detached,
So that we need not to grab a lock to dereference it.
*/
If (RT-> peer ){
/// Obtain the ID of the current peer (that is, the ip_id_count field). Note that the ip_id_count field will automatically increase after inet_getid is called ),
IPH-> id = htons (inet_getid (RT-> peer, more ));
Return;
}
} Else
Printk (kern_debug "rt_bind_peer (0) @ % P/N ",
_ Builtin_return_address (0 ));

/// If peer creation fails, ip_select_fb_ident is called.
Ip_select_fb_ident (IPH );
}

Static void ip_select_fb_ident (struct iphdr * iph)
{
Static define_spinlock (ip_fb_id_lock );
Static u32 ip_fallback_id;
U32 salt;

Spin_lock_bh (& ip_fb_id_lock );
/// Because the peer cannot be obtained, You need to skip the peer subsystem and directly obtain the ID.
Salt = secure_ip_id (_ force _ be32) ip_fallback_id ^ IPH-> daddr );
IPH-> id = htons (salt & 0 xFFFF );
Ip_fallback_id = salt;
Spin_unlock_bh (& ip_fb_id_lock );
}

 

Let's take a look at the IP layer statistics.

The statistics here are represented by the per CPU variable ip_statistics. here we need to know that there is a statistical information in many parts of the network subsystem. The initialization of these statistics is done through ipv4_mib_init_net.

Java code
Static _ net_init int ipv4_mib_init_net (struct net * Net)
{

/// We can see statistics on the TCP and IP layers. Here, each statistical variable is per CPU.
If (snmp_mib_init (void **) Net-> MIB. tcp_statistics,
Sizeof (struct tcp_mib) <0)
Goto err_tcp_mib;
If (snmp_mib_init (void **) Net-> MIB. ip_statistics,
Sizeof (struct ipstats_mib) <0)
Goto err_ip_mib;
If (snmp_mib_init (void **) Net-> mib.net _ STATISTICS,
Sizeof (struct linux_mib) <0)
Goto err_net_mib;
If (snmp_mib_init (void **) Net-> MIB. udp_statistics,
Sizeof (struct udp_mib) <0)
Goto err_udp_mib;
If (snmp_mib_init (void **) Net-> MIB. udplite_statistics,
Sizeof (struct udp_mib) <0)
Goto err_udplite_mib;
If (snmp_mib_init (void **) Net-> MIB. icmp_statistics,
Sizeof (struct icmp_mib) <0)
Goto err_icmp_mib;
If (snmp_mib_init (void **) Net-> MIB. icmpmsg_statistics,
Sizeof (struct icmpmsg_mib) <0)
Goto err_icmpmsg_mib;

Tcp_mib_init (net );
Return 0;
...........................
}

The information collected by each CPU is the packet information transmitted by the interrupt handled by the CPU.

It provides several macros to execute statistics. These macros are executed both in the middle-end context and in the middle-end context.

Java code
# Define ip_inc_stats (net, field) snmp_inc_stats (net)-> MIB. ip_statistics, field)

// Both are out of the interrupt Context
# Define ip_inc_stats_bh (net, field) snmp_inc_stats_bh (net)-> MIB. ip_statistics, field)
# Define ip_add_stats_bh (net, field, Val) snmp_add_stats_bh (net)-> MIB. ip_statistics, field, Val)

 

Finally, let's take a look at the implementation of some IP configuration tools.

Here is just a brief introduction. Let's look at the source code.

There are four methods for Configuration:

1 IOCTL
This is mainly used by ifconfig. The corresponding kernel is the do_ioctl function of netdev.

2 Netlink
It is mainly used by iproute2. For example, rtmgrp_ipv4_ifaddr broadcast group is used to notify the user space and network address change.

3/proc file system.
That is,/proc/sys/NET/IPv4

4 Rapp/Bootp/DHCP
These are the IP addresses configured through remote hosts.

The IP subsystem also provides an inetaddr_chain notification chain to notify other kernel subsystems (such as the routing subsystem and nerfilter masquerading) of IP configuration changes.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.