Layer-3 Ethernet switches are quite common, so I have studied the problems related to the CPU packet sending and receiving of layer-3 Ethernet switches. I would like to share them with you here, hoping they will be useful to you. Layer-3 Ethernet switches are developing rapidly. On the one hand, bandwidth and switching capacity of network devices are greatly increased, and the types of protocols supported by devices are also increasing with the increasing demand of users.
How to ensure the normal interaction of protocol packets between devices in a network environment with a large business volume is an important issue in the design of Ethernet switches. This article takes an ASIC-based layer-3 Ethernet switch as an example, starting from the aspects of CPU load, hardware/software queue configuration, CPU and switching chip communication mechanism, etc, discuss and analyze some typical problems related to the CPU packet sending and receiving function in a multi-process environment, and find a solution. The solution is also applicable to network processors (NP. In the current layer-3 Ethernet Switching Device, the layer-2 switching and layer-3 routing of packets are mainly completed by the switching chip and network processor. the CPU is basically not involved in the switching and routing process, it manages and controls the switching chip.
In this case, the CPU load mainly comes from the following aspects: Protocol timing driver, user configuration driver, and external event driver. Among them, the driver of external events is the most random and unpredictable. Typical external events include port connection/disconnection (Up/Down), and Media Access Control (MAC) Address message reporting (including learning, aging, and migration ), the CPU receives packets through direct memory access (DMA), and the CPU sends packets through DMA.
Among the external events listed above, the processing after the CPU receives the packet through DMA is the most complex. When packets are sent from the lower layer to the upper layer software, the processing actions of each protocol vary widely, and may involve packet sending, port operations, batch table operations, and so on. Therefore, only when issues related to CPU packets are handled can the relevant upper-layer protocols interact normally, so that the switches can run stably and efficiently.
Possible problems
The following sections describe various aspects that may be involved in CPU packets. The following analysis is based on a typical CPU packet sending and receiving mechanism: the CPU port is divided into queues, received through DMA, and uses a ring queue.
CPU load and package Cycle Control
Based on the ability of the switch to process data packets, the number of packets sent to the CPU per unit time is determined. After the number of packets sent to the CPU per unit time is determined, the speed of the packets sent is considered. Assuming that the maximum number of CPU data packets sent per unit time is determined through evaluation, for example, x data packets per second.
(1) reporting CPU at a constant speed
When packets are reported to the CPU at a constant speed, the impact on the CPU queue is small, and the buffer capacity of the CPU queue is not high, so the CPU queue does not have to do a lot.
(2) Report to CPU in Burst Mode
The hardware receiving queue on the exchange chip (using ASIC) side and the ring queue in the DMA memory space provide a certain buffer capacity for the layer-3 Ethernet switch (for packets sent to the CPU ). With this buffer capability, we can extend the control period appropriately and set the control granularity (the upper limit of the number of CPU reports received per unit control cycle ), the CPU package receiving function is dynamically enabled and disabled using a mechanism similar to the negative feedback in the circuit. In this way, the CPU speed of data packets is controlled at a macro level. In addition, if the switch chip (using ASIC) supports the CPU port outbound traffic monitoring or shaping function based on the token bucket algorithm [2-3], the minimum threshold value of regulatory or shaping can meet the requirement of CPU speed limit. This function can be used to control the CPU delivery speed of data packets and reduce the CPU load. This simplifies software processing.
CPU port queue length Planning
If you only consider the buffer capacity of the CPU port of the layer-3 Ethernet switch, the longer the CPU port queue, the better. However, you must consider the impact on other functions and performance. Specific problem analysis is required for different ASIC chips.
Zero copy
Zero copy refers to the use of pointers as parameters during the entire data packet processing process without copying the entire data packet. This greatly improves the CPU processing efficiency. When zero copy is used, the flexibility of software processing will be reduced to a certain extent. We will encounter the problem that if the protocol stack needs to change the content of a data packet, it will directly receive the cache (buffer) but if you want to delete or add fields (such as adding or deleting a layer of tag) in the data packet, that is, what should you do when the length of the data packet needs to change.
Adding or deleting fields will inevitably lead to the movement of the position on one or the end of the packet header. If the end of the packet is moved, the problem is simple as long as the total length of the packet does not exceed the buffer boundary. Generally, such operations are close to the header. If one side of the header is moved, the efficiency will be relatively high. Therefore, the Protocol Stack may be more inclined to move on the other side of the header during processing, in this case, the driver needs to handle the buffer allocation:
(1) When receiving data packets, the header pointer cannot point to the buffer boundary and must be offset to a certain margin. At the same time, the size of a single buffer must be equal to the maximum transmission unit (MTU) and the margin.
(2) When a data packet is released, the first pointer of the buffer needs to be normalized.
Interrupt/round robin
Currently, layer-3 Ethernet switches generate switching chips for external interruptions, including DMA operations (such as packet forwarding, packet sending ends, and new address messages) and some error messages. If the request is interrupted too frequently, frequent context switching between the interrupt service program (ISR) and other processes will consume a lot of CPU time. If a large number of interrupted requests are sustained, the CPU will always be in a busy state, and various protocols cannot get enough scheduling time, resulting in serious faults such as protocol state machine timeout.
To avoid uncontrollable event triggering frequency, you can use the polling mechanism. Generally, the CPU timer is used to trigger the ISR originally triggered by an external interrupt. Because the timer trigger interval is fixed, therefore, ISR execution frequency is controlled to avoid the above problems.
Compared with external interruptions, Round Robin only has a controllable pace (the pace of external interruptions depends on the frequency of external events and the CPU is uncontrollable ). However, polling also has its inevitable disadvantage-slow response. Some features with high real-time requirements cannot be met. In addition, when the ping command is used to detect large packets of layer-3 interfaces of a vswitch, the latency of a layer-3 Ethernet switch in polling mode is significantly greater than that of an interrupted vswitch. If a mechanism can be used to avoid sustained and large number of interrupted requests, it can ensure that the CPU is not too busy and retain the advantages of Real-Time Interrupt Processing. The typical act of generating a large number of interruption events is to report the messages received by the CPU and from the MAC address. Taking package receiving as an example, the Burst method mentioned in the previous section "CPU load and package receiving rhythm control" is to control the receiving DMA switch based on real-time traffic, this achieves the purpose of controlling the interrupt source. This mechanism similar to negative feedback can effectively prevent the continuous interruption event from being reported to the CPU.
In short, the round robin control is simple, but the real-time performance is poor; the interrupt real-time performance is good, but it makes it difficult to control all the interrupt sources. In the initial system design phase, we need to consider both the requirements and the way the chip handles external events to decide whether to use either the interrupt or polling method or both.
External event handling mechanism in multi-process environment
Common External events (Interrupt events) include packet receipt, packet sending (CPU packet sending and receiving), including receiving MAC address messages, and MAC table operations. If we put the handling of various types of Interrupt events in one process, it artificially increases the coupling of various events and increases the chance of mutual restraint of various events.
In a multi-task operating system, in order to be able to handle various events more flexibly and reduce the subconstraints between events, various events should start processes independently as much as possible, or several processes are divided based on different processing methods. At least one process is not suitable for processing.
Protocol package protection and CPU Protection
For layer-3 Ethernet Switches Based on ASIC, Protocol packet protection refers to the use of some mechanisms of ASIC chips to specify specific protocol packets to specific port queues, ensure the priority of the CPU sent through the DMA queue; CPU protection means to minimize the impact of unnecessary data packets on the CPU. Prerequisites for implementing Protocol packet protection:
(1) the CPU port must support scheduling algorithms with strict priority (SP) or Weighted Round Robin (WRR.
(2) The switching chip must have a strong stream classification capability and can specify different port queues for different streams.
In the design of the system scheme, we need to take into account the protection of protocol packets and the protection of CPU. We should do our best:
(1) Ensure that the CPU package receiving channel and the packet sending channel are smooth.
(2) exact match, select as needed. Make full use of the access control list (ACL) function of the ASIC chip to precisely match various Protocol packets. If necessary, a layer-4 field must be matched. Other functions and performance constraints should be taken into account when the above points are implemented.
Reduced efficiency
In a multi-task operating system, events must be processed in as short as possible time slices to ensure that other tasks have sufficient opportunities for scheduling. Therefore, we must consider the execution efficiency when calling any function. In addition to the algorithm itself affecting execution efficiency, frequent access to some hardware is also time-consuming, which is often overlooked.
Conclusion
With the development of Ethernet-related technologies, the processing capabilities of switching chip and network processor are constantly improved. In contrast, the CPU processing performance of data switching equipment is far less than that of switching chip and network processor; at the same time, the business types supported by data exchange devices are also increasing, and the CPU-carried business volume is also demanding. In this case, the conflict between the capacity of the switching device, the significant increase in service types, and the limited CPU resources will become increasingly prominent. Therefore, ensuring the secure and stable operation of data exchange equipment is a prerequisite for ensuring the buffer management, queue scheduling, and traffic monitoring of CPU and switching chips, as well as network processor interfaces, it is also an important topic for data exchange device development at present and in the future.