Analysis of Layer 2 (Link Layer) packet sending Process

Source: Internet
Author: User

Analysis of Layer 2 (Link Layer) packet sending Process
Analysis of L2 (Link Layer) packet sending process -- lvyilong316

Note: The kernel version involved in this series of blog posts is 2.6.32.
After a packet is prepared on the upper layer, the packet is sent to the link layer. The dev_queue_xmit function processes the packet sent on the link layer. Two types of data packets can be sent: one is the normal transmission process, which is driven by the NIC, and the other is Soft Interrupt (see note 3 ). For ease of understanding, first take a look at the overall call relationship diagram of the dev_queue_xmi function.

Ldev_queue_xmit

This function is used to add the sent skb to a dev Queue. Before calling this function, you must set the device and priority of the skb. This function can be called in the interrupt context.

Return Value:

If the returned value is not 0 (positive or negative), the function fails. If the returned value is 0, the data packet is successfully sent, because the data packet may be dropped due to speed limit or other reasons.

After the function is executed, the transferred skb will be released. Therefore, if you want to control data packets, you need to increase the reference count of skb when re-transmitting skb.

When this function is called, the interruption must be enabled because BHenable must require IRQenable. Otherwise, a deadlock may occur.

 
 
  1. Int dev_queue_xmit (struct sk_buff * skb)
  2. {
  3. Struct net_device * dev = skb-> dev;
  4. Struct netdev_queue * txq;
  5. Struct Qdisc * q;
  6. Int rc =-ENOMEM;
  7. /* GSO will handle the following emulations directly .*/
  8. If (netif_needs_gso (dev, skb ))
  9. Goto gso;
  10. If (skb_has_frags (skb )&&
  11. ! (Dev-> features & NETIF_F_FRAGLIST )&&
  12. _ Skb_linearize (skb ))
  13. Goto out_kfree_skb;
  14. // If skb has a shard but the sender device does not support the shard, or the Shard has a shard in the high-end memory but the sender device does not support the DMA, You need to combine all the segments into one, here _ skb_linearize is actually _ pskb_pull_tail (skb, skb-> data_len). This function is basically the same as pskb_may_pull, pskb_may_pull is used to check whether there is sufficient space in the primary buf of skb To Get The len Length, if not, re-allocate the skb and copy the data in frags to the newly allocated primary buff. Here, set the len parameter to skb-> datalen, that is, all the data will be copied to the primary buff, and the skb will be linear in this way.
  15. If (skb_shinfo (skb)-> nr_frags &&
  16. (! (Dev-> features & NETIF_F_SG) | illegal_highdma (dev, skb ))&&
  17. _ Skb_linearize (skb ))
  18. Goto out_kfree_skb;
  19. // If the packet is not calculated and the sender does not support the Protocol, the checksum is calculated here (note 1 ). If it has been linear once, _ skb_linearize will return directly here. Note the difference between frags and frag_list. The former is to put more data into the separately allocated page, and sk_buff has only one. While the latter is connected to multiple sk_buff
  20. If (skb-> ip_summed = CHECKSUM_PARTIAL ){
  21. Skb_set_transport_header (skb, skb-> csum_start-
  22. Skb_headroom (skb ));
  23. If (! Dev_can_checksum (dev, skb) & skb_checksum_help (skb ))
  24. Goto out_kfree_skb;
  25. }
  26. Gso:
  27. // Disable Soft Interrupt and disable cpu Preemption
  28. Rcu_read_lock_bh ();
  29. // Select a sending queue. If the device provides the select_queue callback function, use it. Otherwise, the kernel selects a queue. This is only the implementation of multiple queues in the Linux kernel, however, to use both queues, the NIC must support multiple queues. Generally, only one queue is available for all NICs. When alloc_etherdev is called to allocate net_device Yes, set the number of queues
  30. Txq = dev_pick_tx (dev, skb );
  31. // Obtain the device's qdisc from the netdev_queue Structure
  32. Q = rcu_dereference (txq-> qdisc );
  33. // If the device has a queue available, call _ dev_xmit_skb
  34. If (q-> enqueue ){
  35. Rc = _ dev_xmit_skb (skb, q, dev, txq );
  36. Goto out;
  37. }
  38. // The following processing is performed without a sending queue. Soft devices generally do not have a sending queue, such as lo or tunnle; all we need to do is directly call the driver's hard_start_xmit to send it out. If the sending fails, it will be discarded because no queue can store it.
  39. If (dev-> flags & IFF_UP) {// confirm whether the device is enabled
  40. Int cpu = smp_processor_id ();/* OK because BHs are off */
  41. If (txq-> xmit_lock_owner! = Cpu) {// whether it is on the same cpu
  42. HARD_TX_LOCK (dev, txq, cpu );
  43. If (! Netif_tx_queue_stopped (txq) {// confirm that the queue is running
  44. Rc = NET_XMIT_SUCCESS;
  45. If (! Dev_hard_start_xmit (skb, dev, txq )){
  46. HARD_TX_UNLOCK (dev, txq );
  47. Goto out;
  48. }
  49. }
  50. HARD_TX_UNLOCK (dev, txq );
  51. If (net_ratelimit ())
  52. Printk (KERN_CRIT "Virtual device % s asks"
  53. "Queue packet! \ N ", dev-> name );
  54. } Else {// txq-> xmit_lock_owner = cpu, indicating Recursion
  55. If (net_ratelimit ())
  56. Printk (KERN_CRIT "Dead loop on virtual device"
  57. "% S, fix it urgently! \ N ", dev-> name );
  58. }
  59. }
  60. Rc =-ENETDOWN;
  61. Rcu_read_unlock_bh ();
  62. Out_kfree_skb:
  63. Kfree_skb (skb );
  64. Return rc;
  65. Out:
  66. Rcu_read_unlock_bh ();
  67. Return rc;
  68. }

L _ dev_xmit_skb

The _ dev_xmit_skb function mainly performs two tasks:

(1) If the traffic control object is empty, an attempt is made to directly send data packets.

(2) If the traffic control object is not empty, add the data packet to the traffic control object and run the traffic control object.

 
 
  1. Static inline int _ dev_xmit_skb (struct sk_buff * skb, struct Qdisc * q,
  2. Struct net_device * dev,
  3. Struct netdev_queue * txq)
  4. {
  5. Spinlock_t * root_lock = qdisc_lock (q); // See note 2.
  6. Int rc;
  7. Spin_lock (root_lock); // lock qdisc
  8. If (unlikely (test_bit (_ QDISC_STATE_DEACTIVATED, & q-> state) {// determine whether the queue is invalid
  9. Kfree_skb (skb );
  10. Rc = NET_XMIT_DROP;
  11. } Else if (q-> flags & TCQ_F_CAN_BYPASS )&&! Qdisc_qlen (q )&&
  12. ! Test_and_set_bit (_ QDISC_STATE_RUNNING, & q-> state )){
  13. /*
  14. * This is a work-conserving queue; there are no old skbs
  15. * Waiting to be sent out; and the qdisc is not running-
  16. * Xmit the skb directly.
  17. */
  18. _ Qdisc_update_bstats (q, skb-> len );
  19. If (sch_direct_xmit (skb, q, dev, txq, root_lock ))
  20. _ Qdisc_run (q );
  21. Else
  22. Clear_bit (_ QDISC_STATE_RUNNING, & q-> state );
  23. Rc = NET_XMIT_SUCCESS;
  24. } Else {
  25. Rc = qdisc_enqueue_root (skb, q );
  26. Qdisc_run (q );
  27. }
  28. Spin_unlock (root_lock );
  29. Return rc;
  30. }

Lqdisc_run

There are two opportunities to call qdisc_run ():

1. _ dev_xmit_skb ()

2. Soft Interrupt service thread NET_TX_SOFTIRQ

 
 
  1. Static inline void qdisc_run (struct Qdisc * q)
  2. {
  3. If (! Test_and_set_bit (_ QDISC_STATE_RUNNING, & q-> state) // set the queue to the running status
  4. _ Qdisc_run (q );
  5. }

L _ qdisc_run

 
 
  1. Void _ qdisc_run (struct Qdisc * q)
  2. {
  3. Unsigned long start_time = jiffies;
  4. While (qdisc_restart (q) {// The returned value is greater than 0, indicating that the traffic control object is not empty.
  5. /* If it is found that the queue has been running for too long, it will stop running and add the queue to the output_queue linked list header.
  6. * Postpone processing if (latency processing)
  7. * 1. another process needs the CPU;
  8. * 2. we 've been doing it for too long.
  9. */
  10. If (need_resched () | jiffies! = Start_time) {// you are not allowed to continue running this throttling object.
  11. _ Netif_schedule (q); // Add this qdisc to the output_queue linked list of each cpu variable softnet_data
  12. Break;
  13. }
  14. }
  15. // Clear the running ID of the queue
  16. Clear_bit (_ QDISC_STATE_RUNNING, & q-> state );
  17. }

Call qdisc_restart cyclically to send data. The following function qdisc_restart is the function that actually sends data packets. It extracts the next frame from the queue and then tries to send it out, if the message fails to be sent, it is generally re-queued.

The return value of this function is: the remaining queue length is returned when the message is sent successfully, and 0 is returned when the message fails to be sent (if the message is sent successfully and the remaining queue length is 0, the return value is 0)

Lqdisc_restart

The _ QDISC_STATE_RUNNING status ensures that only one cpu is processing the qdisc at the same time. qdisc_lock (q) is used to ensure sequential access to this queue.

Generally, netif_tx_lock is used to ensure the sequential (exclusive) Access of the driver of the device. qdisc_lock (q) is used to ensure the sequential access of qdisc. The two are mutually exclusive and one of them must be released.

 
 
  1. Static inline int qdisc_restart (struct Qdisc * q)
  2. {
  3. Struct netdev_queue * txq;
  4. Struct net_device * dev;
  5. Spinlock_t * root_lock;
  6. Struct sk_buff * skb;
  7. /* Dequeue packet */
  8. Skb = dequeue_skb (q); // call the dequeue function at the beginning
  9. If (unlikely (! Skb ))
  10. Return 0; // If 0 is returned, the queue is empty or restricted.
  11. Root_lock = qdisc_lock (q );
  12. Dev = qdisc_dev (q );
  13. Txq = netdev_get_tx_queue (dev, skb_get_queue_mapping (skb ));
  14. Return sch_direct_xmit (skb, q, dev, txq, root_lock); // used to send data packets
  15. }

Lsch_direct_xmit

Send a skb and set the queue to the _ QDISC_STATE_RUNNING status. Ensure that only one cpu runs this function. If the returned value is 0, the queue is empty or the sending is limited. If the value is greater than 0, the queue is not empty.

 
 
  1. Int sch_direct_xmit (struct sk_buff * skb, struct Qdisc * q,
  2. Struct net_device * dev, struct netdev_queue * txq,
  3. Spinlock_t * root_lock)
  4. {
  5. Int ret = NETDEV_TX_BUSY;
  6. Spin_unlock (root_lock); // release qdisc, because the device lock will be obtained later.
  7. // Call _ netif_tx_lock à spin_lock (& txq-> _ xmit_lock to ensure exclusive access to the device driver
  8. HARD_TX_LOCK (dev, txq, smp_processor_id ());
  9. If (! Netif_tx_queue_stopped (txq) & // The device is not stopped and the sending queue is not frozen
  10. ! Netif_tx_queue_frozen (txq ))
  11. Ret = dev_hard_start_xmit (skb, dev, txq); // send data packets
  12. HARD_TX_UNLOCK (dev, txq); // call _ netif_tx_unlock
  13. Spin_lock (root_lock );
  14. Switch (ret ){
  15. Case NETDEV_TX_ OK: // if the device successfully sends the packet
  16. Ret = qdisc_qlen (q); // return the remaining Queue Length
  17. Break;
  18. Case NETDEV_TX_LOCKED: // failed to get the device lock
  19. Ret = handle_dev_cpu_collision (skb, txq, q );
  20. Break;
  21. Default: // The device is busy and sent to the queue again (using softirq)
  22. If (unlikely (ret! = NETDEV_TX_BUSY & net_ratelimit ()))
  23. Printk (KERN_WARNING "BUG % s code % d qlen % d \ n ",
  24. Dev-> name, ret, q-> q. qlen );
  25. Ret = dev_requeue_skb (skb, q );
  26. Break;
  27. }
  28. If (ret & (netif_tx_queue_stopped (txq) |
  29. Netif_tx_queue_frozen (txq )))
  30. Ret = 0;
  31. Return ret;
  32. }

Ldev_hard_start_xmit

 
 
  1. Int dev_hard_start_xmit (struct sk_buff * skb, struct net_device * dev,
  2. Struct netdev_queue * txq)
  3. {
  4. Const struct net_device_ops * ops = dev-> netdev_ops;
  5. Int rc;
  6. If (likely (! Skb-> next )){
  7. // It can be seen from this that each packet sent will also be sent to ptype_all, and when the packet socket is created, a member will be registered in ptype_all for those whose proto is ETH_P_ALL, therefore, for a packet socket whose protocol is ETH_P_ALL, both sent and received data can be received.
  8. If (! List_empty (& ptype_all ))
  9. Dev_queue_xmit_nit (skb, dev );
  10. If (netif_needs_gso (dev, skb )){
  11. If (unlikely (dev_gso_segment (skb )))
  12. Goto out_kfree_skb;
  13. If (skb-> next)
  14. Goto gso;
  15. }
  16. // If the sending device does not require skb-> dst, release it here
  17. If (dev-> priv_flags & IFF_XMIT_DST_RELEASE)
  18. Skb_dst_drop (skb );
  19. // Call the sending function registered by the device, that is, dev-> netdev_ops-> ndo_start_xmit (skb, dev)
  20. Rc = ops-> ndo_start_xmit (skb, dev );
  21. If (rc = NETDEV_TX_ OK)
  22. Txq_trans_update (txq );
  23. Return rc;
  24. }
  25. Gso:
  26. ......
  27. }

Ldev_queue_xmit_nit

 
 
  1. Static void dev_queue_xmit_nit (struct sk_buff * skb, struct net_device * dev)
  2. {
  3. Struct packet_type * ptype;
  4. # Ifdef CONFIG_NET_CLS_ACT
  5. If (! (Skb-> tstamp. tv64 & (G_TC_FROM (skb-> tc_verd) & AT_INGRESS )))
  6. Net_timestamp (skb); // record the timestamp entered by the data packet
  7. # Else
  8. Net_timestamp (skb );
  9. # Endif
  10. Rcu_read_lock ();
  11. List_for_each_entry_rcu (ptype, & ptype_all, list ){
  12. /* Never send packets back to the socket they originated from */
  13. // Traverse the ptype_all linked list to find all the original sets of interfaces that meet the input conditions, and cyclically input data packets to the set of interfaces that meet the conditions
  14. If (ptype-> dev = dev |! Ptype-> dev )&&
  15. (Ptype-> af_packet_priv = NULL |
  16. (Struct sock *) ptype-> af_packet_priv! = Skb-> sk )){
  17. // Because this packet is additionally input to this original set interface, you need to clone a packet
  18. Struct sk_buff * skb2 = skb_clone (skb, GFP_ATOMIC );
  19. If (! Skb2)
  20. Break;
  21. /* Skb-> nh shoshould be correctly (ensure that the header offset is correct)
  22. Set by sender, so that the second statement is
  23. Just protection against buggy protocols.
  24. */
  25. Skb_reset_mac_header (skb2 );
  26. If (skb_network_header (skb2) <skb2-> data |
  27. Skb2-> network_header> skb2-> tail ){
  28. If (net_ratelimit () // net_ratelimit is used to ensure the frequency of printk in network code.
  29. Printk (KERN_CRIT "protocol % 04x is"
  30. "Buggy, dev % s \ n ",
  31. Skb2-> protocol, dev-> name );
  32. Skb_reset_network_header (skb2); // reset the L3 header offset.
  33. }
  34. Skb2-> transport_header = skb2-> network_header;
  35. Skb2-> pkt_type = PACKET_OUTGOING;
  36. Ptype-> func (skb2, skb-> dev, ptype, skb-> dev); // call protocol (ptype_all) accept Function
  37. }
  38. }
  39. Rcu_read_unlock ();
  40. }

? Loopback device

For loopback devices, the ops> ndo_start_xmit function of the device is initialized as the loopback_xmit function.

 
 
  1. static const struct net_device_ops loopback_ops = {
  2. .ndo_init = loopback_dev_init,
  3. .ndo_start_xmit= loopback_xmit,
  4. .ndo_get_stats = loopback_get_stats,
  5. };

Drivers/net/loopback. c

 
 
  1. Static netdev_tx_t loopback_xmit (struct sk_buff * skb,
  2. Struct net_device * dev)
  3. {
  4. Struct pcpu_lstats * pcpu_lstats, * lb_stats;
  5. Int len;
  6. Skb_orphan (skb );
  7. Skb-> protocol = eth_type_trans (skb, dev );
  8. /* It's OK to use per_cpu_ptr () because BHs are off */
  9. Pcpu_lstats = dev-> ml_priv;
  10. Lb_stats = per_cpu_ptr (pcpu_lstats, smp_processor_id ());
  11. Len = skb-> len;
  12. If (likely (netif_rx (skb) = NET_RX_SUCCESS) {// The netif_rx is directly called for receiving.
  13. Lb_stats-> bytes + = len;
  14. Lb_stats-> packets ++;
  15. } Else
  16. Lb_stats-> drops ++;
  17. Return NETDEV_TX_ OK;
  18. }


  • Note:
1. CHECKSUM_PARTIAL indicates that the hardware checksum is used. The validation of the L4 pseudo header has been completed and added to the uh-> check field. In this case, the device only needs to calculate the check value of the entire header 4 header.


2. The entire data packet sending logic involves three codes for mutually exclusive access:

(1) spinlock_t * root_lock = qdisc_lock (q );

(2) test_and_set_bit (_ QDISC_STATE_RUNNING, & q-> state)

(3) _ netif_tx_lock à spin _ lock (& txq-> _ xmit_lock)

(1) (3) corresponds to a single spinlock, and (2) corresponds to a queue status. When you understand how to use these three Synchronization Methods in the code, first take a look at the relationship between the relevant data structures, as shown below.

In the figure, the green part indicates the two spinlocks (1) (3. First, check the corresponding code at (1:

  
  
  1. static inline spinlock_t *qdisc_lock(struct Qdisc *qdisc)
  2. {
  3. return &qdisc->q.lock;
  4. }

Therefore, root_lock is used to control access to the skb queue in qdisc. When You Need To enqueue, dequeue, and requeue the skb queue, You need to lock it.

The _ QDISC_STATE_RUNNING flag is used to ensure that a traffic control object (qdisc) is not simultaneously accessed by multiple CPUs.

The spinlock at (3), that is, _ xmit_lock in structnetdev_queue, is used to ensure mutually exclusive access to the dev registration function, that is, synchronization of deriver.

In addition, as written in the kernel code comments, (1) and (3) are mutually exclusive. When obtaining the locks at (1), the locks at (3) must be released first, and vice versa, why .... Who knows?

3. I already have the dev_queue_xmit function. Why do I need soft interruptions for sending?

We can see that skb is processed in dev_queue_xmit (for example, merged into a package, and the checksum is calculated), and the processed skb can be directly sent, at this time, dev_queue_xmit will first team skb (skb is usually in this function) and call qdisc_run to try to send it, but it may fail to send it. Then, it will re-team skb, soft scheduling interruption, and direct return.

The Soft Interrupt only refers to the skb in the sending queue and the skb that has been sent. It does not need to linearly or checksum the skb. In addition, if the queue is stopped, dev_queue_xmit can still add the package to the queue but cannot send the package. In this way, when the queue is awakened, soft interruption is required to send the backlog of packets during the stop period. In short, dev_queue_xmit is the final processing of skb and the first attempt to send it. A Soft Interrupt is to send packets that fail or fail to be sent. (In fact, sending soft Interruptions also plays a role in releasing sent packets, because in some cases, sending is completed during hardware interruptions. In order to improve the processing efficiency of hardware interruptions, the kernel provides a way to put the release skb in the Soft Interrupt. When dev_kfree_skb_irq is called, it adds the skb to the completion_queue of softnet_data, and then enables Soft Interrupt sending, net_tx_action releases all skb in completion_queue in the Soft Interrupt)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.