At present, many x86 Firewall vendors claim that 64 bytes packet line rate forwarding, 94% ......, Haha, let's take a look at Kola's classic discussion about this. <some of them are slightly modified>:
I. wire speed
Line rate Forwarding is an ideal requirement for a network transit device. However, most people usually pay attention to the BPS (BITs) of the device.Per second, the number of digits of data per second). Few people will think that FPS (frame per second, the number of frames per second) actually tests the forwarding capability of devices.Simply put, BPS refers to the number of bytes of data passing through each second, and FPS refers to the number of packets passing through each second. For a 10 MB network, the BPS is 10 MB at line speed, and the maximum FPS is 14880. So how is this 14880 calculated?
First, we need to know several rules:
1. the minimum packet size in Ethernet is 64 bytes, which includes 4-byte CRC, 2-byte Ethernet type (or length), 6-byte source MAC address, 6-byte MAC address and 46-byte load.
2. There must be at least 96-bit (12-byte) frame gap (IFP, inter frame gap) between Ethernet frames and frames to ensure that two packets are differentiated.
3. Each data frame must start with an 8-byte MAC address preamble to ensure that the sender receiver synchronizes the data bit.
Therefore, the smallest data packet over Ethernet is actually 64 + 12 + 8 = 84 bytes = 672 bits. Therefore, in a 10 m network environment, the maximum FPS is 10 m bits per second/672 bits per frame = 14480 frames per second. Similarly, we can calculate that the maximum FPS in a 10 m network environment is 10 m bits per second.
/(1518 + 12 + 8) x 8) bits per frame = 812 frames per second. In a 148809 M network environment, these two values are 8127 and respectively.
2. processing capability
We already know the maximum FPS value in the case of line speed. Now let's look at the processing capability required to reach the line speed.Assume that a firewall on the market uses the x86 CII 900 MHz CPU, which can be divided into M clock cycles per second. Therefore, in a 900 m network environment, the maximum clock cycle allowed to process a data frame is: 148809 M clock cycle per second/6048 frames per second = clock cycle per frame that is to say, to achieve line rate forwarding, A 6048 MHz CPU completes processing of a data packet on average within clock cycles. This is only an ideal situation. The CPU in the X86 architecture is also responsible for handling various types of interruptions (such as system clock). During the next interruption, the current running status must be saved and switched to the interrupt processing program, after the interrupt processing is completed, the current status is restored and the process is switched back to the original one. The switching process alone requires at least 500 clock cycles and does not include the clock cycles used to interrupt the processing program. Fortunately, this type of interruption is not "too frequent". after deducting the system overhead, the average processing time allocated to each packet will be about 5500 clock cycles. Although intel
P3 has optimized a large number of commonly used commands (such as adding or subtracting two registers) to one clock period with the upper-level CPU, such as CII, In the design instruction set, but as a firewall, it is more commonly used to read/write memory data (such as comparing the source address, calculating the IP address checksum, and so on) Such commands with multiple clock cycles, therefore, only about 5500 commands can be executed in 2000 clock periods.
For the processing of a data packet, the Ethernet protocol should be checked in sequence from the memory allocated to this data packet (for rfc1042 format data packets, the LLC header should also be skipped to find the real Ethernet protocol ), check the IP header, checksum (for TCP/UDP protocol, check the corresponding checksum), DOS defense, status detection, Nat rules, security/statistical rules, update arp/route table, select the forwarding Nic, and it will not be completed until the packet is added to the sending queue. Do you think 2000 x86 commands can complete all this?
Iii. Real Data
There seem to be a lot of 2000 commands. In fact, there are not many. For example, the simplest A = a + B Formula Optimization command should also use the following two:
MoV eax, [val_ B]
Add [val_a], eax
There will be four unoptimized items:
MoV eax, [val_a]
MoV EBX, [val_ B]
Add eax, EBX
MoV [val_a], eax
Most of the current firewall development is completed on Unix/Linux. Taking the GCC compiler as an example, its optimization effect is about 20% worse than that of commercial compilers such as VC/BC, that is to say, if the same C code can be compiled into 100 instructions using a commercial compiler, GCC can only compile into 120 instructions in the best case. In fact, without any packet filtering rules or other configurations, it takes about 14000 instructions to process a complete data packet.
Therefore, according to the above calculation, many X86 architecture firewalls (piII 800) Achieve 42% line rate forwarding capability in M network environments, that is, the processing capability of 62000fps. For example, 100%, 95%, and more ......