About PF_RING/Intel 82599/transparent VPN

Source: Internet
Author: User
Tags macbook

Close to the verge of collapse. Today this article is conceived in the hospital, and I am ill again. I 'd rather drop a bottle and take no medicine, but cannot access the Internet with my notebook. I can't do anything, I want to know something. I can only use 3 GB and don't dare to open a hot spot. Because no one reimbursed me for the traffic, I only had one day this weekend. It rained and I had another night. After learning about PF_RING, I was eager to do an experiment, so I ran home for verification and then came back.

The cause is this.

A total of four problems 1. About a network accelerator card a few days ago, I came into contact with a network accelerator card and inserted it into the PCIe slot. The card runs an independent Linux system and communicates with the host through PCIe. This card number is characteristic of packet capture performance? Yes, that is, it is suitable for DPI or transparent firewall. The technical staff of the manufacturer personally demonstrated the function of a firewall for us, that is, they cannot access the website of the excellent database, and can do anything else. I think it is interesting that their accelerator card has 20 thousand m optical ports, one connected to one PC that can access the Internet, and the other connected to the egress switch. In addition to being unable to access the optimal database on the PC, other interfaces can be accessed, while the 20 thousand m optical port is neither connected by the bridge nor an IP address. How can this card be implemented by one port into another, not a bridge, without an IP address, this is simply not a network device. What is it?
2. recently, the company conducted a one-time evaluation of the company's devices, and the results were up to standard. However, I personally think there is still room for improvement, but the evaluation incident happened in Beijing, and I have no experience, we can only regret it. In addition, the tester is not the kind of person who is passionate about technology. Our people are always working with him. The results can be imagined and you can accept them when you are ready, they will never let us use their fierce performance testing instruments as toys.
The tested machine looks very fierce. It has over GB of memory, more than 30 cores of x86 CPU core, and more than 10 Gbit/s of NICs, which is so heavy that people do not want to move, the ugly appearance makes people think that it is just like the food to eat. In short, it is very fierce and cool, but for software, it cannot be said very fierce, I always think, the network performance bottleneck of the x86 architecture coupled with the Linux protocol stack lies in the serial processing and the lock in the kernel. The protocol stack processing itself is serial, therefore, the traditional Linux kernel protocol stack cannot bring the above tough hardware to the maximum performance. This combination is generally provided for virtual machines. If you really want to match this hardware, we recommend that you do not use the Linux kernel protocol stack. If you want to maximize the performance of all hardware without using virtual machines and running traditional applications such as WEB services, it is no different from the rest of the world, the bottleneck is not in the hardware, but in the protocol stack and application server.
This is just like saying.
3. if I have browsed a non-elegant website in the company's NAS online behavior management, I certainly don't want to keep NAS records. Otherwise, the administrators will be able to see it at any time and laugh at me behind the scenes, in fact, administrators have this kind of hobby. After all, in the contemporary era, the power of network management is huge for privacy. Later, I heard that this plug-in is connected to the main link. As a network-savvy and curious R & D engineer, I must want to know how this plug-in achieves layer-2 monitoring, if it is image data, it is really nothing new. The key is not a bypass device, it is indeed a series of devices, that is to say, the data packet actually completely passes its processing logic. To be honest, if it doesn't want you to access XXYY, not just monitoring, it can be used as a firewall (for centralized network management, if it is me, I will certainly open all the access, only monitoring, so that you can see more fun ). This requires the device to have a strong line speed, not the normal Linux can do, how it is achieved, how to achieve such high-performance forwarding, the key is that it is still a layer-2 device.
4. The transparency of VPN should begin with shame. When I implemented my VPN last year, I told the customer that our product was a two-layer device, so the customer was relieved because it would save a lot of maintenance time, because the L2 device does not need to configure network parameters, I think so, but he is wrong and I am also wrong. After implementation, I went home happily, in the next month or even a few months, I was harassed by countless "Routing Problems in Bridge Mode". So I once again impersonate an implementation engineer to explain to the customer, "We pull the traffic we are interested in to the third layer, and the traffic we are not interested in is directly passed through the second layer... "This poem is just like a face, because it clearly belongs to the post-event explanation. It's a barrier. Why didn't we say this in the first place? If the customer understands the technology, I think this is obvious...
From the gap drawn from intensive implementation, I am a little confused. I have been thinking about how to "completely transparent" the implementation of Layer 2 devices (once damaged by customers, it is a cool, the VPN of a company in Shenzhen is a box that can be connected to a link. No configuration is required ...), of course, there are some methods, such as the Linux Netfilter queue/packet socket mechanism, such as the "last hop" of the data packets that are cleverly detected in the kernel, and the returned packet automatically encapsulates the MAC mechanism, in short, let's take a look at my blog post in February. There are a lot of amazing things, but there is no practical one. This is just a magic! Therefore, I need a standard solution to implement layer-2 VPN technology.
The above are some of the questions I have raised. For many people, these questions are not a problem. In fact, they are not a problem for me, however, for many people, it is necessary to give them a clear explanation.
For my explanation below, I think it is necessary to first implement a prototype before it involves performance. Therefore, I will first talk about the general principle and performance tuning.
The devices or Cards mentioned in the above several problems of the device and PF_RING are all one-to-one devices except Question 2, and question 2 is not such a device, it can be seen as a server, not a forwarding device. The reason why it is put into question is to use it to talk about all aspects of performance tuning.
This kind of device is especially suitable for serial connection on the link, because their forwarding principle is very simple and there is no need to look up the table at all. The principle is that data packets from one port must be transmitted from another port, you can think of this device as an expensive full-duplex network cable. This type of device can undoubtedly forward data, but before forwarding, it can still do something else, such as filtering out some traffic, such as recording some traffic... is there any way for the most common people without such hardware devices to see such devices?
Yes. You can easily build one with PF_RING, so what is PF_RING. The best information is its website introduction, but I 'd like to give a brief introduction. The so-called PF_RING is actually a data packet capture/serving mechanism. In essence, it captures a data packet from the driver module and places it in a loop buffer zone, or, you can directly place a constructed ethereframe to the NIC-driven sending queue. In fact, it provides a direct data packet acquisition path without it. Based on the conventional protocol stack logic, A data packet must be serially stripped from each layer of the protocol stack to expose the data of the current layer. Unlike the PACKET socket-based libpcap, The PF_RING mechanism is more flexible:
1. PF_RING uses mmap to place bare network data in a user-mode that can be accessed directly, rather than copying data through the memory of the socket read/write mechanism;
2. PF_RING supports the following three methods to put raw data to the user-state ring buffer using mmap and the 2.1 DNA mode:
2. 1. capture data packets from the netif_receive_skb function by using the PACKET socket method. This is a compatible method with the PACKET socket. The difference is that data packets no longer enter the user State through socket IO, but through mmap;
2. 2. data packets are directly placed into the so-called ring buffer at the NAPI level, and NAPI Polling is directed to the skb peer column. For the first of the two paths, this method is more effective than the method described in 2.1, because it reduces the length of data packets in the kernel path, but requires the NIC to support NAPI and PF_RING interfaces (generally, NAPI will Polling data packets to a skb Queue ).
2.2 and are the same, but NAPI Polling is not executed. This means that data packets will not enter the kernel, but will be directly mapped to the user State by mmap. This is especially suitable for full processing of the user state, not just network auditing, since the kernel does not need to process network data, the CPU will be saved for user-mode network processing. This may change the kernel serial network processing to user-state parallel processing. I will see an illustration below. It's just a click here.
2. 4. this is a more powerful method, called the DNA-supported mode, which directly bypasses all the paths of the kernel protocol stack, that is, data packets are transmitted directly to (DMA) on the NIC chip) the so-called ring buffer, the kernel will not see any data packets, this method and Intel's 10g card combination will be very exciting;

The above 2.1-2.3 and 2.4 methods are different according to the PF_RING Website:




Unfortunately, the 2.4 method is not free to use. It provides free download based on its License, but a Test Library is provided in binary format. If you need to use it for a long time, you need to purchase the unlock code. It is quite pitiful, because people also need money to continue the study.
Many people behind PF_RING think that PF_RING is only a high-performance packet capture mechanism, which provides local packet image analysis and network audit. This is just explained in the traditional way. Furthermore, the PF_RING mechanism subverts the way in which network intermediate nodes interpret data packets. According to the traditional concept, the intermediate network node can only parse data packets layer by layer at the protocol stack level. The so-called router is a layer-3 device, and the switch is a layer-2 device, firewalls are classified into two-layer firewalls and three-layer firewalls... the PF_RING device can directly DMA data packets from the NIC chip to the memory on your machine. That's all. Then, you can process data packets through an application instead of the kernel protocol stack, as for how your application processes data packets, I will list the following:
1. parse data packets in depth, parse sessions according to various granularities you can think of, and then record audit information;
2. provides high-performance intrusion detection functions;
3. forward data packets in the vro mode. However, instead of simply querying the route table for IP routing, you can define the forwarding table by yourself in various ways, such as implementing a general SDN flow table;
4. Based on the meaning of the above 2nd points, you can decide which packages are discarded, which is a high-performance firewall.

Compared with the protocol stack serial solution, PF_RING is a more efficient solution, which is not only efficient but also flexible. If you have a multi-core processor, you can even process data packets at various layers in the user State in parallel, as shown in:




PF_RING caters to the multi-queue feature of modern high-end NICs. This is the truth, but even if it does not support multi-queue NICs, if it is processed according to the current Linux kernel protocol stack, assuming that a specific CPU is interrupted (in fact, Linux's balance is not flattering), we assume that we use a network card that does not support interruption balancing, generally, the Soft Interrupt triggered will also be executed on the CPU. You have no way to do this, even if you have 8 core CPUs, what can you do? But for PF_RING, because you can correspond to multiple rings on a network card, you can place different streams on different rings, or even different packets on different rings, then, each Ring is processed by an application tied to a specific CPU. This actually pushes the so-called multi-queue to a level, as shown in:




For non-forwarding devices, such as an APP Server, that is to say, traffic is terminated locally, the future architecture may look like this, as shown in:




Looking at the figure above, you may ask, how does the device know that the data packet is sent to the local device? In fact, this is not the responsibility of this device. How do you know that what I DMA to the Ring buffer is an Ethernet frame instead of a pure HTTP packet? In short, there is no such thing as the present, which is incredible. What's best about PF_RING is not what it implements, but the mechanism that enables you to implement something. What can you implement in the PF_RING framework, it's totally limited by your imagination, and that's why everyone thinks PF_RING can only capture packets.
The traditional misunderstanding is the same as asking "how does an SDN Switch handle IP layer routing" in an SDN environment, I also asked the network accelerator card provider mentioned in question 1 for the so-called "user-mode protocol stack" because network protocols were handled in the so-called protocol stack in the past, so even if the processing of the network is moved to the user State, I still need a user-state protocol stack for peace of mind. The network can be processed in the user State or in the kernel state, the key is to have a stack so that the network can be processed normally. However, the answer I got is: If you need to develop your own protocol stack, we will fully cooperate and support it... god, do I have to study RFC and write code according to the rules such as mandatory and recommended? How can I process complex IP routing and TCP state machines? Processing TLS... isn't there a ready-made user-mode protocol stack?
In fact, the concept of PF_RING not only subverts the way in which network intermediate nodes interpret data packets, but also the way in which data packets are processed. It makes the network adapter no longer a network adapter, it opens a hole in the protocol stack, that is, data packets may not have to be processed in the stack. (In fact, many L2 firewalls did this before PF_RING, however, the technology based on is not a general technology, and there is no general interface). The stack itself is a serial thing, a pile of sub-plates. If the above is not removed, you can try it. There are many devices implemented based on PF_RING, such as network audit systems, such as firewalls, which do not use user-mode protocol stacks because they do not need protocol stacks, it is to explain the various logic of data packets in layers. It is a specific implementation of decoupling. Only when the heterogeneous networks interoperate, layers are king. If you get the entire data packet, who can stipulate that you must parse and process it in some way? You only need to keep the interface layer consistent with the external network. For example, when a data packet is sent out of your device, it must be in the frame format.
The vendors didn't give me user-mode protocol stacks, but they didn't. I don't want to implement any protocol stacks. Even if there is a ready-made open-source protocol stack, I don't want to transplant it. Even so, you can also implement some useful things. IN fact, forget the protocol stack. For Ethernet, you can ensure that you can correctly receive and process an Ethernet frame (IN interface ), you can ensure that the data you send is an Ethernet frame (OUT interface). This is enough. There is no standard for how you process the data.
BTW, the many questions I mentioned above about the user-mode protocol stack and the content about the network accelerator card have nothing to do with PF_RING, but the idea is the same. Whether you have heard of it or not, there is an ultra-multi-core processing board of the Tilera-GX architecture in the world, while the NetLib suite provided by Tilera provides a very convenient set of development interfaces, this allows you to quickly develop everything about network processing logic that runs in the user State. This NetLib development kit can directly work perfectly with its underlying Tilera core. The idea of Tilera is that after the network processing chip on the Board receives the data packet, it is not handed over to the kernel protocol stack (whether it is Linux or something else), but placed in a ring buffer, this buffer can be directly accessed by the user State. So far, it is basically a copy of PF_RING DNA, but it has a set of NetLib, And the PF_RING mechanism corresponds to libpfring, however, if you look at the document and sample, you will know that NetLib provides more things than libpfring. If you use libpfring, you need to have good imagination and building capabilities. If you use NetLib, many things are ready-made (including the IP Route Search logic), but you still need to have some building capabilities. In any case, the two are connected.
The implementation of a transparent forwarding device involves a lot of theories and insights, and some reasons for the high performance of PF_RING. In fact, many times we are more concerned about what to do, that is, I am more concerned about interfaces and know how to use them. In addition, there is no problem in implementing your trust in the underlying layer. I will give you an in-depth understanding of the principles, it is basically invincible (but in reality, a large number of people only meet the needs of the interface ). In this section, I try to implement a simple transparent forwarding device, that is, an expensive network cable. The simplest word means that it only has the forwarding function and a little bit of filtering function, the purpose is to display the use of interfaces.

The topology structure is very simple. My iMac turns off WIFI and the network cable connects to my Macbook. my Macbook connects to the router through WIFI. The Macbook has a built-in Linux Virtual Machine and adds two NICs in the same Bridge mode, one Bridge to wi-fi and one Bridge to Ethernet, as shown in:




If there is only one computer, this topology is really difficult to build. Don't tell me to use VMWare's LAN Segment. I hate that kind of thing.

The preceding topology configuration is as follows:
IMac-en0: 192.168.1.200/24 default gateway: 192.168.1.1
Macbook-en0: no IP Address
Linux VM-eth0: up no IP Address
Linux VM-eth1: up no IP Address
Macbook-en1: 192.168.1.100/24 (WIFI allocated, no way) default gateway: None
TP-Link lan ip: 192.168.1.1/24
It's clear.
Next, paste the Code:
#include <stdio.h>#include <stdlib.h>#include <pfring.h>#include <string.h>#include <getopt.h>int main(int argc, char *argv[]){        pfring *pfring_net1, *pfring_net2;        unsigned char *dev1 = NULL;        unsigned char *dev2 = NULL;        char c;        struct option opts[] = {                {.name = "net1", .has_arg = 1, .val = 'i'},                {.name = "net2", .has_arg = 1, .val = 'o'},                {NULL}        };        while((c = getopt_long(argc, argv, "i:o:", opts, NULL)) != -1) {                switch(c) {                case 'i':                dev1 = strdup(optarg);                break;                case 'o':                dev2 = strdup(optarg);                break;                }        }        if(dev1 == NULL || dev2 == NULL) {                goto end;        }        pfring_net1 = pfring_open(dev1, 1518, PF_RING_PROMISC);        pfring_net2 = pfring_open(dev2, 1518, PF_RING_PROMISC);        if(pfring_net1 == NULL || pfring_net2 == NULL) {                goto end;        }        if (pfring_set_bpf_filter(pfring_net1, "arp or tcp or udp")) {                goto end;        }        if (pfring_set_direction(pfring_net1, rx_only_direction) ||                pfring_set_direction(pfring_net2, rx_only_direction)) {                goto end;        }        if (pfring_enable_ring(pfring_net1) ||                pfring_enable_ring(pfring_net2)) {                goto end;        }        while(1) {                unsigned char *pkt;                struct pfring_pkthdr ring_hdr;                if(pfring_recv(pfring_net1, &pkt, 0, &ring_hdr, 0)) {                        pfring_send(pfring_net2, pkt, ring_hdr.caplen, 1);                }                                if(pfring_recv(pfring_net2, &pkt, 0, &ring_hdr, 0)) {                        pfring_send(pfring_net1, pkt, ring_hdr.caplen, 1);                }        }end:        if (pfring_net1) {                pfring_close(pfring_net1);        }        if (pfring_net2) {                pfring_close(pfring_net2);        }        return 0;}

Compile it:
Gcc test. c-o test-lpcap-lpfring-lrt
Note that the above libpcap is not the apt-get, but the libpcap under the userland directory in the PF_RING software package. In short, this article is not a PF_RING document, so do not go into these details, in addition, we would like to thank the CSDN resource channel for uploading. 《Pf_ring .v5.4.4.pdfThere is no need for a book. The key is that the resource score is 0! Compiled Program Execution ./Test-I eth0-o eth1Then open the web page on iMac, completely OK, and ping Baidu? No! Why? Because this sentence:
pfring_set_bpf_filter(pfring_net1, "arp or tcp or udp")
Only arp, tcp, and udp are allowed. icmp is not allowed.
According to the above Code, you only need to make some modifications to build your machine into a transparent device, but remember that you are using the PF_RING technology, why do we need to emphasize this? As far as the effect is concerned, libpcap/PACKET sockets can be used in the same way as NetLib of Tilera, but PF_RING is different. Compared with PACKET sockets, the principle of PF_RING improves the performance. Compared with NetLib, PF_RING is more passed and the interface design is better. If you want to learn PF_RING, download the latest version, decompress and compile the package. The package is rich in content, including customized Driver, user-mode library, and kernel module, it should be emphasized that there are many examples, and every README is worth reading.
Modern Gigabit/10-Gigabit Ethernet cards-Intel 82599 as an Example

If NetLib is a proprietary solution of the Tilera architecture, PF_RING is the general solution corresponding to the x86 architecture. Intel 825.99 million m cards provide many new mechanisms and many new extensions for the old mechanism. Among all the new and old mechanisms, the most exciting thing is the refinement of Multi-queue, as shown in:



If the multi-queue is combined with PF_RING, it will be shown in the following figure:



Looking at this picture, the kernel protocol stack seems redundant, indeed. If we don't use the PF_RING mechanism, how does the Linux kernel protocol stack cope with the independent RX Queues? If all these Queue are processed by different CPU cores, can the performance be improved? The answer is of course yes, but don't forget that the main character is the protocol stack. The traditional Linux kernel protocol stack uses a large number of global linked lists, such as skb, route cache, socket, as long as you operate on these data structures, lock is required. In addition to the mutex overhead, the kernel protocol stack is serialized in nature, that is, when processing MAC, the IP layer cannot be pre-processed, or when processing the IP layer, unable to pre-process TCP, which causes the Linux kernel to be at a loss when facing such an Intel 82599 Nic. Even if you have more than 60 CPU cores, it seems useless in the Linux kernel, in fact, although I have not tested it, I think the same result is true for Windows, and even the UNIX optimized by the vendor is not good enough.
To split a large serial kernel protocol stack into multiple to free up the lock overhead, the method is to split the protocol stack itself. For Linux, the lightweight method is to divide multiple Net namespaces, each ns has a set of network protocol stack data structures. network operations between different ns do not need to be mutually exclusive. But it is quite uncomfortable that Linux ns cannot cut Intel 82599 Multi RX Queue, alas... for a device with only one 82599 card, it can only belong to one ns, which is actually blocking this path! But fortunately, the Linux kernel and Intel driver can be changed at will... the heavyweight way is to use virtual machines. Intel's official drivers fully support virtual machines.
All this is too troublesome. If there is a user-mode protocol stack that is directly connected to the PF_RING, isn't it better? Every CPU core is still running. Why did I get to the protocol stack again? Didn't I forget the protocol stack? Yes, the protocol stack is useless on the device that comes in and out, but the TCP logic still needs to be processed as long as the TC/IP address is still in the application server, for example, you have to deal with the TCP state machine, process flow control, and control. This doesn't mean that the Linux protocol stack must be processed in the kernel, since the data packet passes through the NIC chip in turn and the multi-queue logic, the PF_RING data packets reach the user State lossless all the way, why not directly let the data packet into a protocol stack? For application servers, if you use PF_RING, it is better to have an unlocked user-mode protocol stack.

How to deal with the software of modern mengka-PF_RING DNA (Derict NIC Access) NIC chip, CPU, chipset, and bus performance are rapidly improving. What about the software? Unfortunately, the software seems a little tricky. However, PF_RING makes further concessions for you, that is, the only memory copy is saved, that is, data packets do not need to be copied from the NIC to the mmap to the user memory, instead, when the physical layer receives the packet, it directly places the packet to a specified place. Specifically, it is to map the memory on the NIC to somewhere in the address space. The virtual memory is really a good thing.
Install a driver that supports the DNA Nic and load it into the DNA user library and NDA application. The NIC will no longer be a nic, but it will simply be extended from the remote end to the local memory.
Based on the PF_RING VPN device, I can answer the VPN question. If PF_RING is used, I will capture the data packet to the user State process, retrieve the MAC header, and use the IP header for backup, encrypt the entire IP datagram, and then use the obtained IP header, the MAC header to re-encapsulate the encrypted data (transmission mode), or use the new IP header, the backup MAC encapsulation data (tunnel mode) can even directly encrypt the TCP/UDP load and retain the TCP/UDP header. In this case, my VPN device can achieve super flexible encryption and decryption without configuring an IP address, and it is really an expensive network cable.
Now, at 01:11 on January 1, June 22, 2014, Argentina and Iran are still. There is a dangerous attack in the country... five days away from their birthday and four days away from the end of my nightmare.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.