Analysis of Layer 2 (Link Layer) packet sending Process

Last Update:2016-02-27 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Analysis of Layer 2 (Link Layer) packet sending Process
Analysis of L2 (Link Layer) packet sending process -- lvyilong316

Note: The kernel version involved in this series of blog posts is 2.6.32.
After a packet is prepared on the upper layer, the packet is sent to the link layer. The dev_queue_xmit function processes the packet sent on the link layer. Two types of data packets can be sent: one is the normal transmission process, which is driven by the NIC, and the other is Soft Interrupt (see note 3 ). For ease of understanding, first take a look at the overall call relationship diagram of the dev_queue_xmi function.

Ldev_queue_xmit

This function is used to add the sent skb to a dev Queue. Before calling this function, you must set the device and priority of the skb. This function can be called in the interrupt context.

Return Value:

If the returned value is not 0 (positive or negative), the function fails. If the returned value is 0, the data packet is successfully sent, because the data packet may be dropped due to speed limit or other reasons.

After the function is executed, the transferred skb will be released. Therefore, if you want to control data packets, you need to increase the reference count of skb when re-transmitting skb.

When this function is called, the interruption must be enabled because BHenable must require IRQenable. Otherwise, a deadlock may occur.

 
 
  
  Int dev_queue_xmit (struct sk_buff * skb)
 
  
  {

  
  Struct net_device * dev = skb-> dev;

  
  Struct netdev_queue * txq;

  
  Struct Qdisc * q;

  
  Int rc =-ENOMEM;

  
  /* GSO will handle the following emulations directly .*/

  
  If (netif_needs_gso (dev, skb ))

  
  Goto gso;

  
  If (skb_has_frags (skb )&&

  
  ! (Dev-> features & NETIF_F_FRAGLIST )&&

  
  _ Skb_linearize (skb ))

  
  Goto out_kfree_skb;

  
  // If skb has a shard but the sender device does not support the shard, or the Shard has a shard in the high-end memory but the sender device does not support the DMA, You need to combine all the segments into one, here _ skb_linearize is actually _ pskb_pull_tail (skb, skb-> data_len). This function is basically the same as pskb_may_pull, pskb_may_pull is used to check whether there is sufficient space in the primary buf of skb To Get The len Length, if not, re-allocate the skb and copy the data in frags to the newly allocated primary buff. Here, set the len parameter to skb-> datalen, that is, all the data will be copied to the primary buff, and the skb will be linear in this way.

  
  If (skb_shinfo (skb)-> nr_frags &&

  
  (! (Dev-> features & NETIF_F_SG) | illegal_highdma (dev, skb ))&&

  
  _ Skb_linearize (skb ))

  
  Goto out_kfree_skb;

  
  // If the packet is not calculated and the sender does not support the Protocol, the checksum is calculated here (note 1 ). If it has been linear once, _ skb_linearize will return directly here. Note the difference between frags and frag_list. The former is to put more data into the separately allocated page, and sk_buff has only one. While the latter is connected to multiple sk_buff

  
  If (skb-> ip_summed = CHECKSUM_PARTIAL ){

  
  Skb_set_transport_header (skb, skb-> csum_start-

  
  Skb_headroom (skb ));

  
  If (! Dev_can_checksum (dev, skb) & skb_checksum_help (skb ))

  
  Goto out_kfree_skb;

  
  }

  
  Gso:

  
  // Disable Soft Interrupt and disable cpu Preemption

  
  Rcu_read_lock_bh ();

  
  // Select a sending queue. If the device provides the select_queue callback function, use it. Otherwise, the kernel selects a queue. This is only the implementation of multiple queues in the Linux kernel, however, to use both queues, the NIC must support multiple queues. Generally, only one queue is available for all NICs. When alloc_etherdev is called to allocate net_device Yes, set the number of queues

  
  Txq = dev_pick_tx (dev, skb );

  
  // Obtain the device's qdisc from the netdev_queue Structure

  
  Q = rcu_dereference (txq-> qdisc );

  
  // If the device has a queue available, call _ dev_xmit_skb

  
  If (q-> enqueue ){

  
  Rc = _ dev_xmit_skb (skb, q, dev, txq );

  
  Goto out;

  
  }

  
  // The following processing is performed without a sending queue. Soft devices generally do not have a sending queue, such as lo or tunnle; all we need to do is directly call the driver's hard_start_xmit to send it out. If the sending fails, it will be discarded because no queue can store it.

  
  If (dev-> flags & IFF_UP) {// confirm whether the device is enabled

  
  Int cpu = smp_processor_id ();/* OK because BHs are off */

  
  If (txq-> xmit_lock_owner! = Cpu) {// whether it is on the same cpu

  
  HARD_TX_LOCK (dev, txq, cpu );

  
  If (! Netif_tx_queue_stopped (txq) {// confirm that the queue is running

  
  Rc = NET_XMIT_SUCCESS;

  
  If (! Dev_hard_start_xmit (skb, dev, txq )){

  
  HARD_TX_UNLOCK (dev, txq );

  
  Goto out;

  
  }

  
  }

  
  HARD_TX_UNLOCK (dev, txq );

  
  If (net_ratelimit ())

  
  Printk (KERN_CRIT "Virtual device % s asks"

  
  "Queue packet! \ N ", dev-> name );

  
  } Else {// txq-> xmit_lock_owner = cpu, indicating Recursion

  
  If (net_ratelimit ())

  
  Printk (KERN_CRIT "Dead loop on virtual device"

  
  "% S, fix it urgently! \ N ", dev-> name );

  
  }

  
  }

  
  Rc =-ENETDOWN;

  
  Rcu_read_unlock_bh ();

  
  Out_kfree_skb:

  
  Kfree_skb (skb );

  
  Return rc;

  
  Out:

  
  Rcu_read_unlock_bh ();

  
  Return rc;

  
  }

L _ dev_xmit_skb

The _ dev_xmit_skb function mainly performs two tasks:

(1) If the traffic control object is empty, an attempt is made to directly send data packets.

(2) If the traffic control object is not empty, add the data packet to the traffic control object and run the traffic control object.

 
 
  
  Static inline int _ dev_xmit_skb (struct sk_buff * skb, struct Qdisc * q,
 
  
  Struct net_device * dev,

  
  Struct netdev_queue * txq)

  
  {

  
  Spinlock_t * root_lock = qdisc_lock (q); // See note 2.

  
  Int rc;

  
  Spin_lock (root_lock); // lock qdisc

  
  If (unlikely (test_bit (_ QDISC_STATE_DEACTIVATED, & q-> state) {// determine whether the queue is invalid

  
  Kfree_skb (skb );

  
  Rc = NET_XMIT_DROP;

  
  } Else if (q-> flags & TCQ_F_CAN_BYPASS )&&! Qdisc_qlen (q )&&

  
  ! Test_and_set_bit (_ QDISC_STATE_RUNNING, & q-> state )){

  
  /*

  
  * This is a work-conserving queue; there are no old skbs

  
  * Waiting to be sent out; and the qdisc is not running-

  
  * Xmit the skb directly.

  
  */

  
  _ Qdisc_update_bstats (q, skb-> len );

  
  If (sch_direct_xmit (skb, q, dev, txq, root_lock ))

  
  _ Qdisc_run (q );

  
  Else

  
  Clear_bit (_ QDISC_STATE_RUNNING, & q-> state );

  
  Rc = NET_XMIT_SUCCESS;

  
  } Else {

  
  Rc = qdisc_enqueue_root (skb, q );

  
  Qdisc_run (q );

  
  }

  
  Spin_unlock (root_lock );

  
  Return rc;

  
  }

Lqdisc_run

There are two opportunities to call qdisc_run ():

1. _ dev_xmit_skb ()

2. Soft Interrupt service thread NET_TX_SOFTIRQ

 
 
  
  Static inline void qdisc_run (struct Qdisc * q)
 
  
  {

  
  If (! Test_and_set_bit (_ QDISC_STATE_RUNNING, & q-> state) // set the queue to the running status

  
  _ Qdisc_run (q );

  
  }

L _ qdisc_run

 
 
  
  Void _ qdisc_run (struct Qdisc * q)
 
  
  {

  
  Unsigned long start_time = jiffies;

  
  While (qdisc_restart (q) {// The returned value is greater than 0, indicating that the traffic control object is not empty.

  
  /* If it is found that the queue has been running for too long, it will stop running and add the queue to the output_queue linked list header.

  
  * Postpone processing if (latency processing)

  
  * 1. another process needs the CPU;

  
  * 2. we 've been doing it for too long.

  
  */

  
  If (need_resched () | jiffies! = Start_time) {// you are not allowed to continue running this throttling object.

  
  _ Netif_schedule (q); // Add this qdisc to the output_queue linked list of each cpu variable softnet_data

  
  Break;

  
  }

  
  }

  
  // Clear the running ID of the queue

  
  Clear_bit (_ QDISC_STATE_RUNNING, & q-> state );

  
  }

Call qdisc_restart cyclically to send data. The following function qdisc_restart is the function that actually sends data packets. It extracts the next frame from the queue and then tries to send it out, if the message fails to be sent, it is generally re-queued.

The return value of this function is: the remaining queue length is returned when the message is sent successfully, and 0 is returned when the message fails to be sent (if the message is sent successfully and the remaining queue length is 0, the return value is 0)

Lqdisc_restart

The _ QDISC_STATE_RUNNING status ensures that only one cpu is processing the qdisc at the same time. qdisc_lock (q) is used to ensure sequential access to this queue.

Generally, netif_tx_lock is used to ensure the sequential (exclusive) Access of the driver of the device. qdisc_lock (q) is used to ensure the sequential access of qdisc. The two are mutually exclusive and one of them must be released.

 
 
  
  Static inline int qdisc_restart (struct Qdisc * q)
 
  
  {

  
  Struct netdev_queue * txq;

  
  Struct net_device * dev;

  
  Spinlock_t * root_lock;

  
  Struct sk_buff * skb;

  
  /* Dequeue packet */

  
  Skb = dequeue_skb (q); // call the dequeue function at the beginning

  
  If (unlikely (! Skb ))

  
  Return 0; // If 0 is returned, the queue is empty or restricted.

  
  Root_lock = qdisc_lock (q );

  
  Dev = qdisc_dev (q );

  
  Txq = netdev_get_tx_queue (dev, skb_get_queue_mapping (skb ));

  
  Return sch_direct_xmit (skb, q, dev, txq, root_lock); // used to send data packets

  
  }

Lsch_direct_xmit

Send a skb and set the queue to the _ QDISC_STATE_RUNNING status. Ensure that only one cpu runs this function. If the returned value is 0, the queue is empty or the sending is limited. If the value is greater than 0, the queue is not empty.

 
 
  
  Int sch_direct_xmit (struct sk_buff * skb, struct Qdisc * q,
 
  
  Struct net_device * dev, struct netdev_queue * txq,

  
  Spinlock_t * root_lock)

  
  {

  
  Int ret = NETDEV_TX_BUSY;

  
  Spin_unlock (root_lock); // release qdisc, because the device lock will be obtained later.

  
  // Call _ netif_tx_lock à spin_lock (& txq-> _ xmit_lock to ensure exclusive access to the device driver

  
  HARD_TX_LOCK (dev, txq, smp_processor_id ());

  
  If (! Netif_tx_queue_stopped (txq) & // The device is not stopped and the sending queue is not frozen

  
  ! Netif_tx_queue_frozen (txq ))

  
  Ret = dev_hard_start_xmit (skb, dev, txq); // send data packets

  
  HARD_TX_UNLOCK (dev, txq); // call _ netif_tx_unlock

  
  Spin_lock (root_lock );

  
  Switch (ret ){

  
  Case NETDEV_TX_ OK: // if the device successfully sends the packet

  
  Ret = qdisc_qlen (q); // return the remaining Queue Length

  
  Break;

  
  Case NETDEV_TX_LOCKED: // failed to get the device lock

  
  Ret = handle_dev_cpu_collision (skb, txq, q );

  
  Break;

  
  Default: // The device is busy and sent to the queue again (using softirq)

  
  If (unlikely (ret! = NETDEV_TX_BUSY & net_ratelimit ()))

  
  Printk (KERN_WARNING "BUG % s code % d qlen % d \ n ",

  
  Dev-> name, ret, q-> q. qlen );

  
  Ret = dev_requeue_skb (skb, q );

  
  Break;

  
  }

  
  If (ret & (netif_tx_queue_stopped (txq) |

  
  Netif_tx_queue_frozen (txq )))

  
  Ret = 0;

  
  Return ret;

  
  }

Ldev_hard_start_xmit

 
 
  
  Int dev_hard_start_xmit (struct sk_buff * skb, struct net_device * dev,
 
  
  Struct netdev_queue * txq)

  
  {

  
  Const struct net_device_ops * ops = dev-> netdev_ops;

  
  Int rc;

  
  If (likely (! Skb-> next )){

  
  // It can be seen from this that each packet sent will also be sent to ptype_all, and when the packet socket is created, a member will be registered in ptype_all for those whose proto is ETH_P_ALL, therefore, for a packet socket whose protocol is ETH_P_ALL, both sent and received data can be received.

  
  If (! List_empty (& ptype_all ))

  
  Dev_queue_xmit_nit (skb, dev );

  
  If (netif_needs_gso (dev, skb )){

  
  If (unlikely (dev_gso_segment (skb )))

  
  Goto out_kfree_skb;

  
  If (skb-> next)

  
  Goto gso;

  
  }

  
  // If the sending device does not require skb-> dst, release it here

  
  If (dev-> priv_flags & IFF_XMIT_DST_RELEASE)

  
  Skb_dst_drop (skb );

  
  // Call the sending function registered by the device, that is, dev-> netdev_ops-> ndo_start_xmit (skb, dev)

  
  Rc = ops-> ndo_start_xmit (skb, dev );

  
  If (rc = NETDEV_TX_ OK)

  
  Txq_trans_update (txq );

  
  Return rc;

  
  }

  
  Gso:

  
  ......

  
  }

Ldev_queue_xmit_nit

 
 
  
  Static void dev_queue_xmit_nit (struct sk_buff * skb, struct net_device * dev)
 
  
  {

  
  Struct packet_type * ptype;

  
  # Ifdef CONFIG_NET_CLS_ACT

  
  If (! (Skb-> tstamp. tv64 & (G_TC_FROM (skb-> tc_verd) & AT_INGRESS )))

  
  Net_timestamp (skb); // record the timestamp entered by the data packet

  
  # Else

  
  Net_timestamp (skb );

  
  # Endif

  
  Rcu_read_lock ();

  
  List_for_each_entry_rcu (ptype, & ptype_all, list ){

  
  /* Never send packets back to the socket they originated from */

  
  // Traverse the ptype_all linked list to find all the original sets of interfaces that meet the input conditions, and cyclically input data packets to the set of interfaces that meet the conditions

  
  If (ptype-> dev = dev |! Ptype-> dev )&&

  
  (Ptype-> af_packet_priv = NULL |

  
  (Struct sock *) ptype-> af_packet_priv! = Skb-> sk )){

  
  // Because this packet is additionally input to this original set interface, you need to clone a packet

  
  Struct sk_buff * skb2 = skb_clone (skb, GFP_ATOMIC );

  
  If (! Skb2)

  
  Break;

  
  /* Skb-> nh shoshould be correctly (ensure that the header offset is correct)

  
  Set by sender, so that the second statement is

  
  Just protection against buggy protocols.

  
  */

  
  Skb_reset_mac_header (skb2 );

  
  If (skb_network_header (skb2) <skb2-> data |

  
  Skb2-> network_header> skb2-> tail ){

  
  If (net_ratelimit () // net_ratelimit is used to ensure the frequency of printk in network code.

  
  Printk (KERN_CRIT "protocol % 04x is"

  
  "Buggy, dev % s \ n ",

  
  Skb2-> protocol, dev-> name );

  
  Skb_reset_network_header (skb2); // reset the L3 header offset.

  
  }

  
  Skb2-> transport_header = skb2-> network_header;

  
  Skb2-> pkt_type = PACKET_OUTGOING;

  
  Ptype-> func (skb2, skb-> dev, ptype, skb-> dev); // call protocol (ptype_all) accept Function

  
  }

  
  }

  
  Rcu_read_unlock ();

  
  }

? Loopback device

For loopback devices, the ops> ndo_start_xmit function of the device is initialized as the loopback_xmit function.

 
 
  
  static const struct net_device_ops loopback_ops = {
 
  
  .ndo_init = loopback_dev_init,

  
  .ndo_start_xmit= loopback_xmit,

  
  .ndo_get_stats = loopback_get_stats,

  
  };

Drivers/net/loopback. c

 
 
  
  Static netdev_tx_t loopback_xmit (struct sk_buff * skb,
 
  
  Struct net_device * dev)

  
  {

  
  Struct pcpu_lstats * pcpu_lstats, * lb_stats;

  
  Int len;

  
  Skb_orphan (skb );

  
  Skb-> protocol = eth_type_trans (skb, dev );

  
  /* It's OK to use per_cpu_ptr () because BHs are off */

  
  Pcpu_lstats = dev-> ml_priv;

  
  Lb_stats = per_cpu_ptr (pcpu_lstats, smp_processor_id ());

  
  Len = skb-> len;

  
  If (likely (netif_rx (skb) = NET_RX_SUCCESS) {// The netif_rx is directly called for receiving.

  
  Lb_stats-> bytes + = len;

  
  Lb_stats-> packets ++;

  
  } Else

  
  Lb_stats-> drops ++;

  
  Return NETDEV_TX_ OK;

  
  }

Note:

1. CHECKSUM_PARTIAL indicates that the hardware checksum is used. The validation of the L4 pseudo header has been completed and added to the uh-> check field. In this case, the device only needs to calculate the check value of the entire header 4 header.

2. The entire data packet sending logic involves three codes for mutually exclusive access:
(1) spinlock_t * root_lock = qdisc_lock (q );
(2) test_and_set_bit (_ QDISC_STATE_RUNNING, & q-> state)
(3) _ netif_tx_lock à spin _ lock (& txq-> _ xmit_lock)
(1) (3) corresponds to a single spinlock, and (2) corresponds to a queue status. When you understand how to use these three Synchronization Methods in the code, first take a look at the relationship between the relevant data structures, as shown below.

In the figure, the green part indicates the two spinlocks (1) (3. First, check the corresponding code at (1:

static inline spinlock_t *qdisc_lock(struct Qdisc *qdisc)

{

return &qdisc->q.lock;

}

Therefore, root_lock is used to control access to the skb queue in qdisc. When You Need To enqueue, dequeue, and requeue the skb queue, You need to lock it.
The _ QDISC_STATE_RUNNING flag is used to ensure that a traffic control object (qdisc) is not simultaneously accessed by multiple CPUs.
The spinlock at (3), that is, _ xmit_lock in structnetdev_queue, is used to ensure mutually exclusive access to the dev registration function, that is, synchronization of deriver.
In addition, as written in the kernel code comments, (1) and (3) are mutually exclusive. When obtaining the locks at (1), the locks at (3) must be released first, and vice versa, why .... Who knows?
3. I already have the dev_queue_xmit function. Why do I need soft interruptions for sending?
We can see that skb is processed in dev_queue_xmit (for example, merged into a package, and the checksum is calculated), and the processed skb can be directly sent, at this time, dev_queue_xmit will first team skb (skb is usually in this function) and call qdisc_run to try to send it, but it may fail to send it. Then, it will re-team skb, soft scheduling interruption, and direct return.
The Soft Interrupt only refers to the skb in the sending queue and the skb that has been sent. It does not need to linearly or checksum the skb. In addition, if the queue is stopped, dev_queue_xmit can still add the package to the queue but cannot send the package. In this way, when the queue is awakened, soft interruption is required to send the backlog of packets during the stop period. In short, dev_queue_xmit is the final processing of skb and the first attempt to send it. A Soft Interrupt is to send packets that fail or fail to be sent. (In fact, sending soft Interruptions also plays a role in releasing sent packets, because in some cases, sending is completed during hardware interruptions. In order to improve the processing efficiency of hardware interruptions, the kernel provides a way to put the release skb in the Soft Interrupt. When dev_kfree_skb_irq is called, it adds the skb to the completion_queue of softnet_data, and then enables Soft Interrupt sending, net_tx_action releases all skb in completion_queue in the Soft Interrupt)

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Analysis of Layer 2 (Link Layer) packet sending Process

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support

Analysis of Layer 2 (Link Layer) packet sending Process

Contact Us

What's Trending

Top 10 Tags

Top 10 Keywords

Trending Topic

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support