Test and optimize software-based network analysis tools

Last Update:2018-12-04 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

Summary
A large number of devices that provide network monitoring, statistics, security, traffic analysis, and warning services for network operations and network management need to directly and effectively access the flow of data on the network in real time. Software tools are often favored for their low cost and high compatibility. However, given that these tools are generally not performing well in a high-speed network environment. Although it is widely believed that real-time network analysis software is limited by the network environment and thus far from achieving the desired performance, we will prove that the performance of network analysis software can still be improved with limited hardware support. This article will discuss the bottleneck of Winpcap library widely used in network analysis software, and propose a solution for increasing efficiency. To promote software-based network application tools in a high-speed network environment.

1. Introduction
With the increase in bandwidth, the modulation and demodulation networks have developed rapidly, and the networks have become more complex. To implement application functions such as monitoring, problem solving, and network security, a high level of professional equipment is required. These tools, such as network analyzer, firewall, and device monitoring, are based on hardware. However, hardware solutions are often expensive and difficult to publish (for example, hardware cannot be copied and cannot be moved), and they are less flexible than software solutions.
So far, software-based solutions are generally implemented as extensions of standard operating systems to provide original network real-time data to applications. Like the famous libpcap [1] and Winpcap [3], they have been applied to many operating systems (OSES. These libraries output a set of primitives that allow applications to perform network application operations without the involvement of any other intermediary layer. Convenient and flexible release of software components: a simple package Capture component can provide underlying data to applications (such as firewalls, Nat, Sniffer, and network monitors ). They are easy to upgrade and cost-effective. Because of this, many professional software that provides network monitoring and analysis is implemented in this way. However, performance issues are a critical weakness of software tools. When dealing with high-speed networks, using hardware is the preferred choice. Although the current CPU is very powerful, real-time traffic analysis cannot be achieved only through software at a high speed bit rate.
At present, research groups around the world (such as [10], [13]) are working in this area. Improving the overall performance of network analysis tools is a public topic. Currently, they focus on specific traffic analysis components (such as package filters) and improve the performance of these functional components in various ways. However, users are only interested in the overall traffic analysis performance, rather than a single component. Therefore, the results of the research work were not effectively accepted by individual users. Specifically, this work balances the proportion of various components in network analysis tools. A set of optimization solutions for the Winpcap database are configured and applied in the test version. Then, the improved performance will be tested using quantitative methods. The detection data shows that the system performance is improved by only a few percent due to component optimization, and does not change the view of end users.
This article consists of the following parts. Section 2nd provides an overview of relevant research results. Section 3rd describes the structure of Winpcap as a typical System Extension component. Section 4th presents detailed performance evaluation results for each component (such as Package Interception components and system application layer components. Section 5th describes some optimization schemes, weigh the proportion of components, and check their impact on the overall performance. Finally, we will make a summary in section 6th.

2. Field Development
CMU/Stanford package filter [14] (or CSPF) can be described as a veteran package filter. It is the first feasible system to achieve data access from the application layer to the data link layer. It also introduces the concept of virtual filter: A package filter based on virtual CPU (including registers) and a streamlined and efficient instruction set. Filters are executable programs that can work on virtual machines.
Mccanne and Van Jacob are two other outstanding contributors to the field. They released the (BPF) package filter in 1993 [2]. It limits the number of packages to be copied and defines a new, more efficient, register-based virtual processor complete Instruction Set (that is, containing basic commands such as load, store, compare, and jump) to improve CSPF. Currently, the BSD operating system still provides BPF as the default package capture tool. Other operating systems also have similar implementation tools. The BPF virtual processor is also the underlying implementation preferred by the libpcap library.
Mach packet filter (MPF) [9], Pathfinder [12], DPF [11], BPF + [10] are filters that improve the performance of traffic analysis by improving the filter processor. Packet Classification [15] is another traffic analysis component that is similar to the packet filter concept. In addition, its role in the routing device has attracted a lot of attention.
On the other hand, there are also very few studies targeting other forms of packet capture technology, such as buffer copy. The NFR team used a larger buffer zone to upgrade BPF version performance and checked the shared buffer to prevent duplicate copies [13]. Winpcap [3] is a Windows library with public code. It implements a more efficient buffer system through memory possession to improve Libpcap. However, the previous work did not focus on the performance of the overall analysis process, which is the purpose of this Article. The test results show why the existing optimization measures can only save the execution time of the capture process by a few percent.

3. packet capture Model
This section defines a typical structure model for Package Interception and traffic analysis components. In particular, the reference data packet is accepted from the NIC and then transmitted to the master storage of the workstation, then, the full path of the device driver, operating system, and other intermediary applications is described in detail. Among them, NPF (netgroup Packet Filter) [3] is driven by BPF and has been implanted into the Winpcap library. As mentioned in this article, NFP has the same basic principle as other filters. Figure 1 and figure 2 show the specific steps and related components involved in Winpcap processing an accepted data packet and transmitting it to the application.
. Nic and nic Adapter Driver
Modulation and demodulation Nic onboard memory usually only has several Kbytes. Independent of the storage capacity of the master workstation, Nic memory is used to receive and send packets at full speed. In addition, the ASN performs some preliminary checks, such as CRC checksum and checks whether the requirements for Ethernet short frames are met. When data packets are stored on the NIC, bad frames are discarded in time.

When the NIC receives a valid data packet, the NIC requests control of the bus to the system bus controller to transmit data. In this case, the NIC obtains control of the bus and transmits the NIC buffer data to the workstation master memory (see figure 2). After the bus is released, the NIC interrupts the controller chip (advanced programmable interrupt controller_apic) to the advanced program) generate hardware interruption. The APIC then wakes up the OS interrupt handler and triggers the interrupt service program (Interrupt Service routine_isr) on the NIC device drive ).
The efficient device-driven interrupt service program (ISR) will fulfill its responsibilities. The most basic, for example, it will detect and confirm the interruptions related to the device because many devices share the same interrupt on the x86 machine. Then, ISR presets a low-priority function (called deferred procedure call or DPC ), it sends post-processing hardware requests to the data packet and reports the upper-layer drivers (protocol-layer drivers and packet interception drivers ). After all the interrupt request tasks in the waiting queue are completed, the CPU starts the DPC program. Since only one data packet can be processed at a time, Nic interruption will be temporarily disabled when the NIC device driver processes the data packet. In addition, interrupt generation is a process that consumes a lot of system resources. Now, the new scheme allows more than one data packet to be transmitted during an interruption, this enables the upper-layer driver to process several data packets simultaneously when activated.

3.2 packet capture driver
Packet Capturing components are similar to protocol stacks and are generally transparent to other software modules, which will not affect the work of standard systems. They just insert a "hook" in the system so that you can call the return function-tap ()-to learn the arrival of data packets in a timely manner. In Win32 systems, packet capture components are generally implemented in the form of network protocol-driven.
The first reflection of tap () is to filter data packets, that is, to detect whether they are of interest to the user. The NPF filter engine evolved from BPF. It is a virtual processor integrated with a simple instruction set. It can perform some simple byte operations on the universal buffer (data packet stack. Winpcap (and libpcap) provides a user-layer API that transfers high-level language expressions (for example, "extract all UDP Packets") into a set of pseudo commands (for example, "if the Ethernet field header is an IP address and the IP address of the protocol field header is equal to 17, then true is returned"), and then they are sent to the filter, and the desired command is excited. The above architecture is applied to packet filtering when data packets are located in the NIC buffer, so as to reduce unnecessary copies of unmatched data packets. However, this will still occupy bus resources because they have been sent to the system memory.
When the filter receives data packets, it also obtains some relevant physical layer information systems, such as the length and timestamp, which will facilitate future application processing. Then the data packet is copied into a buffer zone, which is usually called the core buffer zone, where the data packet is waiting to be transmitted to the user layer (see figure 2 ). For the packet interception process, whether the buffer size and architecture design are reasonable directly affects the overall performance. For example, a large capacity and scientific design buffer system can make up for the slow speed of user-layer applications during the peak of complexity, and reduce the drive from packet interception (that is, the core cache area) the number of requests that send call data to an application.
User-layer applications obtain data packets from the core Buffer in a way similar to system calls. When NPF is ready, the system calls the "Hook" function-read () (see Figure 1) to check the NPF core cache: When the buffer zone is not empty, the buffer content is transmitted to the memory allocated by the user program (marked as "user buffer" in Figure 1 "). When the data is copied to the user layer, the application is immediately awakened and starts to process the data.

4. Performance Evaluation
In this section, we will present detailed measurements of the network analysis system. The purpose is to determine the efficiency of the packet interception process. For components mentioned in section 3rd, We will calculate the exact number of resources they consume.
According to the General Statistics, we use the CPU clock cycle consumption. In fact, the advantage is to be more objective and fair, because this measurement method is transparent to the absolute time and CPU speed of a specific system.
4.1 Test Platform
Figure 3 shows the test platform: two PCs are directly connected through high-speed Ethernet. A pc acts as a communication generator. The other is used for actual testing. An improved version of Winpcap is installed on it, which contains the extension plug-in for testing. Specifically, this extension can call the performance monitoring counter of the Pentium series microprocessor [5] [6]. The series of microprocessors all contain built-in counters (the type and quantity depend on the type and version of the microprocessor ). These counters can be used to track events, such as the number of compiled commands, the number of interrupted requests, and the number of loaded caches. For example, the CLK unhalted counter of the CPU counts the number of minutes and weeks in which the CPU is valid within a specific interval. In addition, this counter can identify the number of users consumed and the number of cores consumed. Specifically, the program can use the rdpmc command to obtain the value of this counter.
In addition, the extension plug-in uses a specific dynamic link library (DLL), and core drivers use them to measure the CPU clock required by specific code. These libraries use the x86 rdtsc (read time-stamp counter) command to determine the exact number of CPU clocks consumed by the given code during execution.
Finally, the extension plug-in uses Intel vtune Performance Analyzer [16] to implement effective sampling technology. The CPU is frozen at a precise interval, and the CPU status is checked to determine the driver or function currently being executed. The sampling process can last for a long time, and the whole process of software modules involved in interception is counted.
The two PCs are equipped with network adapters from different vendors. 3Com 3c996 Gigabit Ethernet Adapter has a good reputation. It operates at a speed of 100 Mb/s and consumes a considerable amount of resources on the host system. The detailed analysis process is based on the intel 85527 Ethernet Adapter. Its device driver is an efficient source code provided by Microsoft driver development kit [7. It is also a few windows Nic drivers with public code, which will facilitate further research and improvement of the Code.

The communication generator installed on a PC can generate burst data packets at a precise frequency. The Communication Generator points to a host that does not exist in the network segment, so that the packets it sends will not affect the protocol stacks of the two hosts on the test platform. Both PCs use Microsoft Windows XP. We use different packet sending rates for testing, although most Ethernet adapters have a maximum rate limit (148809 frames/second, 64 bytes per frame ), however, this is also the worst test environment for packet monitoring and analysis tools.
The traffic test module is quite simple (just a frame rate constant), because our purpose is to test the software at a long and high load. However, in reality, the traffic condition is better than the test environment we selected (for example, variable wave type and Poisson distribution ).
4.2 Additional factors
Many factors will affect the packet processing process, such as NIC drivers and OS, although strictly speaking they are not part of the packet capture module. However, when processing data packets, they are closely related to the packet processing process and are very important.
4.2.1 Operating System
When the network adapter receives a packet, the operating system is the first component to reflect the packet. The amount of resources consumed by the operating system is related to the packet sending rate of network traffic, but it is mainly proportional to the number of interrupted responses. The interruption is generated by the NIC. This mechanism notifies the System of the arrival of data packets and waits for processing. On our testing platform, each interruption requires about 2700 clock cycles. During the test, 3Com Nic generates 148 interruptions per second at a speed of 2999 k fps, which is equivalent to 54 clock cycles for each frame of data. The speed varies depending on the network adapter and network traffic. It also depends on the performance of the interrupt processing mechanism at the core layer of the operating system. For example, in the test, three operating system functions are the largest energy-consuming ones, which occupy a considerable number of clock cycles: halbeginsystemlnterrupt () (improves the current interrupt level and identifies the Interrupt Controller ), kedispatchlnterrupt () (DPC routine for NIC driver execution), kelnitializelnterrupt (): This is informal (but it may be used to lower the current interrupt level and remove the identifier in the Interrupt Controller ).
4.2.2 nic and its Device Drivers
Although the network adapter can work hard to fulfill its responsibilities, it does not require the central CPU to participate in data processing. However, its performance will affect other components, especially OS and device drivers. For example, the number of request interruptions (which affects OS consumption) and the number of register access I/O operations on the NIC (which affects Device Drivers) have an important impact on performance.
For the latter, the ISR function (which is the first function of the device driver and called every interrupt) is very simple but overhead is quite high (about 850 clock cycles are required ), this is because it requires paired I/O operations on the NIC to indicate that the current drive is processing packets.
You can improve the performance by increasing the number of data packets extracted from the NIC buffer during each response interruption. This reduces OS overhead (less interrupt requests) and driver costs (less I/O operations on the NIC ). The data from the test results shows that increasing the number of data packets loaded during each interruption will result in a linear increase in performance. For example, 3Com Nic can process 49.61 frames of data (corresponding to 2999 interruptions per second) at an average speed of 148 k fps. This means that the maximum speed of the adapter has been reached. Therefore, in the packet processing process, the low-layer components (Interrupt Processing, NIC Driver) are very limited in terms of upgrading the performance of the entire network analysis software.
According to our test, when the NIC Driver Only integrates Winpcap and does not come with other protocols (the maximum rate is 148809 packets per second), the intel 85527 adapter requires 2260 clock cycles, the 3Com 3 C996 adapter requires 1497 clock cycles.
Finally, it is difficult to quantify the overhead of another additional factor. Data packets are transmitted from the NIC to the system memory through the system bus controller. Although this does not consume the CPU clock, the bus also belongs to system resources during transmission. From a macro perspective, the bus is a factor that cannot be obtained and will delay CPU requests.
4.3 package capture program
This section analyzes the resource overhead required for capturing all program components. For example, the resource overhead of data packets from the NIC Device Driver (and OS) to the entire path of the user application.
4.3.1 filter program
The filter program should be particularly concerned because it is the only part to process all packets flowing through the NIC. Obviously, its quality depends not only on the efficiency of the filter engine, but also on the complexity of the filter, such as the number of checks for each package. Figure 4 shows the time consumption of three filters (increasing complexity in order): the simplest one is to accept only IP packets (it requires three NPF virtual processor pseudo commands ), the second checks the TCP port data packets for five different values (21 pseudo commands need to be executed), and the third is the most complex, it performs packet inspection on 10 IP addresses and 10 TCP ports (50 pseudo commands are required ).
Although the cost of filtering only depends on the nic and system bus structure, it has nothing to do with the clock cycle. However, in this article, we will use the clock period as the Measurement Parameter and evaluation standard.
Execute all the filter program code to filter data streams. As expected, the number of clock cycles consumed by filtering commands is linearly proportional to the number of filter commands, as shown in figure 4. Generally, a typical filter requires hundreds of clock cycles.

4.3.2 memory copy
As explained in section 3rd, data packets are copied twice before they reach the user program (figure 1): for the first time, they are first transferred from the NIC buffer to the core cache (figure 2 ). The second time, it is transmitted to the user application cache. Figure 5 shows the clock overhead of two data copies on the computer of the test platform, and the corresponding data packet size.
According to the NDIS specification, the ndistransferdata () function implements and completes the first transmission. The overhead of this function is very high, mainly due to two reasons.
1. Additional overhead is required before the copy process. Considering that the entire data packet may fail after the NIC Driver exits control of the packet drive, the driver must use this function according to DDK document [7. In this way, the function first checks whether the entire data packet has been transmitted to the memory by the NIC; if not, wait until the transmission is complete.
2. The data objects operated by this function are not in the CPU cache. Data packets are transmitted from NIC onboard storage to primary storage through the control bus (see 3.1 ).
The previous two points explain the reason why the first copy was highly open, as shown in Figure 5. In the processing process, there is an objective fact that there is an open sales limit base, which is independent of the size of the data to be processed (that is why when the data packet is small, processing each byte requires a higher cost ). When data packets are large, the overhead is evenly distributed to every byte, so the sales volume per byte is smaller.
The second copy uses standard C library functions (such as memcpy ()). Because most of the data to be copied to the user zone may not be in the CPU cache, the average cost of processing each byte increases with the increase of the data packet, because the hit rate decreases. In addition, the sales volume for processing each byte of data is also related to the size of the core layer storage area. For example, if the core storage area has a small capacity, some data will still be stored in the CPU cache when you copy data from the NIC storage area.
In summary, the clock cycle overhead of the first copy of each packet is between 540 and 10500, and the second copy changes between 259 and 8550. In fact, considering that each package needs to insert a frame header containing 20 bytes (including timestamp, package length, and other information) before it is stored in the core storage area ), therefore, the total overhead of the second copy is between 364 and 8664.

4.3.3 Application
The interaction between all applications and the package driver is completed through system calls. Windows provides readfile (), writefile (), deviceio-control (), and other system calls to complete I/O operations. All these system calls must complete the "context switches" (the term "context switch" is not accurate here. In fact, the process from the user layer to the core layer is a master privilege switch, rather than a context execution switch): first, the command is sent from the user layer (Application) to the core layer (driver ). Second, return the control of the user application.
As we all know, context switches are quite complicated (they usually include interrupt generation and initialization of Some OS data structures) and costly. For example, on our testing platform, a system call like read () requires 33500 clock cycles. Such high overhead! If only one data packet is copied for each system call, the efficiency is much lower. Therefore, the packet capture device transmits the entire packet every time the application system calls it. The number of data packets transmitted by each system call depends on the size of the core layer storage zone and is proportional to the CPU load capacity. In addition, it also depends on the complexity of user-layer applications (if the application processes data packets for a long time, it will take a longer time to reload data from the core storage area, therefore, the data in the core buffer zone cannot be "drained" in time. On the other hand, it is also affected by the packet capture drive (the code in the drive has a higher execution priority, therefore, the core buffer is often in the full load State ).
Because the frequency of read operations and the number of data packets processed by each system call are random numbers (impermanence), the clock overhead required for processing each packet cannot be clearly described. On an overloaded machine (that is, the CPU usage reaches 100%), assuming that it receives the minimum frame, at the same time, the capture drive transmits K Bytes during each system call (this value is the selected value on Winpcap .), This is equivalent to 3200 packets (including the 20-byte frame header added later by the driver ). In this case, the context switch for processing each packet costs approximately 10 clock cycles, but this is compared to the overhead of other components we will talk about next, this can even be ignored.
4.3.4 other processing components
Although we generally subconsciously think that during packet filtering and transmission and copying, capturing a drive requires the most execution time (that's why almost all documents focus on it to improve performance ). However, our tests reveal other key factors related to packet capture costs. Timestamp is the most significant among them.
The NPF driver obtains the data packet timestamp through the kequeryperformancecounter () Win32 function. It is the only core function that provides microsecond precision. This function is associated with the system crystal oscillator chip, so its overhead is huge: On our testing platform, it requires about 1800 clock cycles (by testing several single processor machines, ). This function takes several microseconds to return a result with precise timing in microseconds. This is ridiculous, but it is true. More importantly, it must continue to operate to add an accurate timestamp when the data packet arrives.
Additional overhead also includes interaction with NDIS and the core layer (the interaction operation usually requires the use of the system to call functions, so the overhead is huge ). It also includes managing (memory ing and ing) The core layer buffer, and adding frames to data packets in NPF. In short, apart from the overhead of filtering and copying, there are nearly 830 other overhead clock cycles.
4.4 All overhead
Figure 6 shows the clock overhead of related components in the form of disk slices. The total fee for processing a package is 5680 minutes. Related configurations generated by the test result are as follows: the flow rate is 148 kfps, the packet size is 64 bytes, the adapter is 3Com 3c996 Gigabit, and the number of virtual instruction lines processed by filtering is 21.
Obviously, as shown in figure 6, when the data packet is small, the timestamp and nic drive have the largest overhead. Because the overhead of these two parts mostly depends on the hardware, software optimization is of little significance. Some small-scale optimization programs for Nic drivers are not widely used because their developers refuse to publish code. However, in any case, NIC driver software optimization is far less effective than upgrading the chipset on the NIC card.
Most importantly, Figure 6 shows why most documents focus on copying and filtering to reduce system overhead. But in fact, the results are quite limited: it may only bring about a 15% improvement in performance.
In our performance analysis, most of the data packets are small. It should be explained that this is not because of the limitations. For example, many network analysis tools (especially sniffers and network monitors) extract only the starting part of the package, such as the first 98 bytes, and the remaining parts will be discarded. This is why we are considering using small data packets during testing.

4.5 additional instructions on test results
Although our test results are taken from a specific tool (Winpcap) on a specific platform (win32), the results are still universal. The overhead of Winpcap (that is, tap processing, first and second copy, filtering) is similar to that of other architectures (for example, NPF and BPF are quite similar ). Similarly, the overhead of the operating system is similar: Nic drive, timestamp, and context switch. NIC Driver overhead can be reduced by setting some of the original software functions to the chip when designing the NIC, but this requires expensive costs. Hardware-based timestamps are one of the most effective optimization solutions: the widespread use of the endace's Dag card [17] is one example. Because intel-based hardware lacks professional chips for x86 design, it cannot provide any simple method to obtain accurate timestamps, but can only use internal logic (such as CPU counters ). Worse, precise timing is obtained from 8253/8254 chips (or other similar chips). However, because they need to be operated in/out through the system bus, their access is quite slow.
Finally, we can ignore the impact of context switches, because most of the situations in different operating systems are similar (for example, in the modem operating system, it is used as a parameter that requires careful settings ).

5. Optimal
This section describes and evaluates the optimization measures made in NPF to address these bottlenecks described in previous sections.
5.1 filter parts
The filtering system used by Winpcap is BSD package filter (BPF), which is proposed for use in 1993 [2. At that time, some other filtering systems [9] [10] [11] [12] were recorded in the document. However, in normal operating environments, their performance cannot be compared with BPF.
In the BPF optimization solution, the generation of dynamic code (that is, the package filtering code is compiled into CPU execution instructions) ensures excellent performance improvement [11] [10]. Therefore, the just in time (JIT) engine is integrated into NPF to compile BPF filtering code into 80x86 binary code. As shown in figure 7, the optimization has increased by 3.1 to 5. This greatly improves the overall performance of the capture mechanism by 8% (assuming a filter with the number of virtual commands being 21 ).

5.2 storage copy
As mentioned earlier, the overhead of the first packet copy (from NIC memory to core buffer) is greater than that of the first packet copy (2nd times. The reason is that it has an additional processing ndistransferdata (). However, we note that almost all network controllers (most network adapters) have completely transmitted a packet to the memory before reporting the NIC Driver, therefore, the NPF driver can obtain it in an adjacent buffer. In this case, we discard the old method, but use a standard C library function to complete the copy operation. Result 8 is displayed. Because data packets are transmitted in blocks and a high hit rate is maintained in the CPU cache, the speed of 2nd copies may also increase. The overhead of the two copies is similar and has the same change trend.

Thanks for the optimization. On a fully loaded machine with a packet size of 64 bytes, the overhead of the 1st copy process is reduced from 540 clock cycles to 300. Of course, the overhead of the second copy process does not change much. This improves the performance of the entire capture mechanism by 4%.
5.3 Timestamp
Kequeryperformancecounter () can be replaced by the timestamp counter (TSC) Counter in the 32-bit intel processor, and accurate timestamp in microseconds can be obtained. This high-efficiency counter completes an auto-increment in every processor's clock cycle, so its precision is close to the CPU clock frequency. In addition, the x86 assembler provides a fast (only one cycle) command-rdtsc to get the exact timestamp. Because the 64-bit division is required twice to convert the result to the standard struct timeval value, the rdtsc command is used to obtain the timestamp for 270 clock cycles. This optimization improves the performance of the entire capture mechanism by 6.6. However, the default value of this optimization measure on the NPF Standard Edition is disabled because it requires support in a specific mode (rdtsc only works on Intel cpu and its compatible products, for example, an amd x-Dragon processor is related to the speed of the processor (some processors adjust their frequencies based on external parameters, such as battery strength. In any case, this indicates that obtaining a timestamp through a simple hardware component can significantly improve performance.
5.4 optimize the tap () function
Similar tap () functions can also be obtained by using the standard C-Library Memory copying program instead of using ndistransferdata. If you use the ndistransferdata () function to transmit data packets, you must configure the structure of the contained data packets and preset the callback function when the copy is completed. These steps can be avoided without using the ndistransferdata () function to significantly improve performance. Simplified tap () Processing reduces the capture overhead from 830 clock cycles to 560 clock cycles, resulting in a 5% improvement in the overall performance of the entire capture mechanism.
5.5 overall overhead after optimization
Figure 9 shows that the core layer still processes data packets of 64 bytes when the conditions are the same as those in figure 6, however, the overhead of each part after the optimization scheme introduced in this solution is implemented. The total overhead after optimization is 3164 clock cycles, that is, half of the overhead before optimization.

Click to view the chart
It is worth noting that 49% of the CPU time after optimization is used for NIC Driver and core interrupt processing. In fact, the overhead of these two parts is not affected by the optimization measures. The optimization measures for other components have increased by 2.6 times.
5.6 feasible hardware acceleration
Figure 9 shows that most of the overhead is caused by factors other than the package Capture component. A local New Zealand company provides a device card optimized for packet capture: endace [17]. Their device cards do not have complex hardware optimizations, But they solve the problem well, for example, they reduce the indirect overhead of the operating system. At the very least, these device cards generate a timestamp (via hardware) for each received packet and transmit them to system memory, which are managed by smarter hardware. Benefits from the operating system's organizational structure (such as most of the work performed by the NIC Driver) and OS-native protocol stack upgrades, applications (in the user space) by bypassing these steps, you can directly obtain the required data without passing through other intermediate layers.
Although we failed to get any direct experimental data (because these device cards only work on Linux, and our test platform is based on Win32 ), we realized that the device cards did reduce the indirect overhead for packet filtering, but they also added additional overhead for processing classes such as tap (buffer management is not completely completed by hardware, in addition, some interactions with the core layer are required ). [18] it is confirmed that the indirect cost (on Linux) is 521 clock cycles. Some optimization measures can be used to reduce it to 190 clock cycles (using Dag API instead of pcap API ), compared with the original system, the performance is improved by 10.9 times (29.9 times if further optimization is performed); compared with the software-optimized system, the performance is improved by 6.1 times (16.7 times if further optimization is performed). Such powerful hardware support brings a bright future to software-based packet capture and analysis tools.
6. Conclusion
This article tests software-based network analysis and monitoring tools (such as Sniffer) and describes the test results. In the experiment, we used the CPU clock period as a measure of the overhead of each component of the packet capture mechanism. A valuable quantitative conclusion of this study: we have all agreed that the performance of the filtering and buffering components is a decisive factor in the overall performance of the packet capture mechanism, but this is not the case. The optimization of these two components has always been paid much attention to. However, the improvement results show that this does not significantly reduce the overall overhead of the packet capture mechanism, especially in the case of short packets (or when short-byte ing is required ). Our tests are carried out on a real system, and the results show that the biggest bottleneck is hidden in the dark, such as the interaction between device drivers, applications and OS, and between OS and hardware.
Because the performance of the packet capture mechanism is affected by many factors, the optimization measures are not concentrated on a single few components. This article points out that, in some places where appropriate, some optimization measures are more effective than trying to repeat the system structure, instead of focusing on packet filtering and buffering mechanisms, as mentioned in some documents. In addition, this article also pointed out that the performance of software-based traffic analysis tools can be further improved, although the relevant algorithm technology has been considered to have reached the limit, for example, hardware is used to obtain timestamps and bypass the indirect overhead of the operating system.
All the details and data about the test can be obtained on the Winpcap homepage [8], and relevant documents, source code, and examples are also provided. Most of the optimization measures mentioned in this article have been implemented in Winpcap 3.0.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More

Test and optimize software-based network analysis tools

Contact Us

A Free Trial That Lets You Build Big!

Sales Support

After-Sales Support