Introduction to Linux Kernel Engineering-network: Filter (LSF, BPF)

Source: Internet
Author: User
Tags map data structure

Packet filtering

The LSF (Linux socket filter) originates from the BPF (Berkeley Packet filter) and is based on the same architecture, but is simpler to use. Its core principle is to provide the user with a socket option: So_attach_filter. Allows a user to add a custom filter on a SOKCET, only packets that meet the criteria specified by the filter are sent to the user space. Because there are many kinds of sokket, you can add this filter to sockets in each dimension, if you add the raw socket, you can implement filtering based on all IP packets (tcpdump is the principle), if you want to do an HTTP analysis tool, You can add a filter to a socket based on port 80 (or another HTTP listening port). There is also a way to use offline, using the Libpcap capture packet stored locally, and then you can use the BPF code to perform offline analysis of the packet, which is very helpful for experimenting with new rules and testing the BPF program. Even lower-level usage, you can write directly in the kernel module EBPF (in the kernel of the expression method, described later) program, directly into the kernel's execution process.

2/proc/sys/net/core/bpf_jit_enable

The BPF mode such as close, open, debug log can be implemented by writing 0/1/2 like this.

In user space use, the simplest way is to use the Libpcap engine, because BPF is a compilation type of language, its own writing difficulty is higher, so LIBPCAP provides some upper package can be directly called. However, Libpcap does not provide all the requirements, such as the testing requirements of the BPF module developers, and the need for high-end custom BPF scripts. In this case, you need to write the BPF code yourself, and then use the tools in the kernel tools/net/directory to compile the BPF assembly code, and then use the socket interface to pass in the code. The BPF engine is mostly provided as a module in the kernel, and can be replaced with different engines, commonly used by NetFilter, XT_BPF, CLS_BPF,
The full support of the kernel for BPF is starting from 3.9, as part of the iptables, the default is XT_BPF, and the client's library is LIBXT_BPF. Iptables at the beginning of the management of the rules is a sequential implementation of the implementation of this method is inevitably in the number of matching when the performance bottleneck, adding BPF support, flexibility greatly improved.

Other BPF Programs

The above-mentioned BPF program is used to do packet filtering, then the BPF code can only be used to do packet filtering? Not also. The kernel's BPF support is an infrastructure that is simply an expression of intermediate code that provides a common interface to the user space to inject executable code into the kernel. But most of the current applications are using this interface to do packet filtering. Other such as the Seccomp BPF can be used to restrict the user process can use the system calls, CLS_BPF can be used to classify traffic, PTP dissector/classifier (do not know), etc. are using the kernel of the EBPF language architecture to achieve their own purposes, is not necessarily a packet filtering function.

User Space BPF Support

Tools: Tcpdump, Tools/net, Cloudfare, Seccomp BPF

User space BPF Compilation architecture analysis

Each assembly instruction in the BPF is in the following format:

struct sock_filter {    /* Filter block */    __u16   code;   /* Actual filter code */    __u8    /* Jump true */    __u8    /* Jump false */    __u32   k;      /* Generic multiuse field */};

One: Op:16, Jt:8, Jf:8, k:32
Code is the real assembly instruction, JT is the jump that the instruction result is true, the JF is the jump for false, K is the parameter of the instruction, different according to the instruction. A BPF program is compiled as an array of sock_filter, and can be programmed using a similar assembly syntax, and then compiled using the BPF_ASM program provided by the kernel.
The BPF is actually a virtual machine in the kernel, with its own defined set of virtual registers. is consistent with the principles of our familiar Java virtual machines. The design of this virtual machine is where LSF's success lies. There are 3 types of registers:

  A           32位,所有加载指令的目的地址和所有指令运算结果的存储地址  X           32位,二元指令计算A中参数的辅助寄存器(例如移位的位数,除法的除数)  M[]         0-15共16个32位寄存器,可以自由使用

One of the most common uses is to make judgments from the data in which a word is taken from a packet. According to BPF, we can use offsets to specify any location of the packet, and many protocols are common and fixed, such as ports and IP addresses, and BPF provides us with some predefined variables that can be directly taken to the corresponding packet location using this variable. For example:

Len Skb->len Proto Skb->protocol type skb->pkt_typePoffPayloadStart offset Ifidx skb->dev->ifindex NLANetLinkAttribute of  type X with offset A NlanNested NetLinkAttribute of  type X with offset A Mark Skb->mark Queue skb->queue_mapping Hatype Skb->dev->typeRxhash Skb->hash CPU raw_smp_processor_id () VLAN_TCI Skb_vlan_tag_get (SKB) vlan_avail skb_vlan_tag_present (SKB) VLAN _tpid Skb->vlan_proto Rand Prandom_u32 ()

More valuable is that this list can also be extended by the user himself. The specific implementations of the various BPF engines also define their respective extensions.

BPF support for the kernel

As we can see, the user side even the compiled BPF code is only a struct array of the kernel, and there is still a gap between the actual assembler code and the concrete executable. For binary code that can be executed directly, you also need to compile it in the kernel. The first is to compile the struct array submitted by the user into the EBPF code. This code is the kernel-level virtual assembly code, the assembly code is not written by the user, but the user needs to complete the sock_filter structure of the array, the subsequent conversion to the EBPF code is the kernel itself. The EBPF code is then transformed into a binary that can be executed directly. Before EBPF, the kernel expression method is called Classic BPF format, which is still used in many platforms, this code is the same as the type of assembly used in user space, but in the X86 architecture, now in the kernel state has switched to using EBPF as the intermediate language. This means that the x86 used in user space is not the same as that used in kernel space. But the kernel in the definition of EBPF has tried to re-use the BPF code, and some of the instruction encoding and meaning, such as bpf_ld are exactly the same.
So it can be seen that EBPF's ambitions never end here, as a platform intermediate language that exists in the kernel, he wants to compile the code logic that all users want to execute in the kernel into EBPF.

Kernel EBPF Assembly Architecture Analysis

`
* R0 - return value from in-kernel function, and exit value for eBPF program
* R1 - R5 - arguments from eBPF program to in-kernel function
* R6 - R9 - callee saved registers that in-kernel function will preserve
* R10 - read-only frame pointer to access stack

`
To match the more powerful features, the register used by the EBPF assembly has increased, and the presence of the registers above fully embodies the concept of function invocation, rather than the original logic of loading processing. With the logical setting of a function call you can directly invoke functions inside the kernel (this is a security risk, but there is an internal circumvention mechanism). Not only that, because this register architecture is very similar to the real register architecture of CPU such as x86, the actual implementation is the direct register mapping, that is to say, these virtual registers are actually using the same function of the real register, which is undoubtedly a great increase in efficiency. Also, on a 64-bit computer, these computers will have a 64-bit width, perfect for hardware capabilities. But the current 64-bit support is not perfect, but it's already available.
The current kernel implementation can only invoke pre-defined kernel functions in the EBPF program, and cannot invoke other EBPF programs (because EBPF itself is not the concept of a function). It doesn't seem to matter, but it's a great ability, which means you can use C to implement EBPF program logic, and EBPF just need to call this C function.

EBPF Data Interaction: Map

EBPF is not only a program, but also access to external data, it is important that this external data can be managed in user space. The map data body in this k-v format is managed by invoking the BPF system invocation creation, addition, deletion, etc. in user space.

Direct programming method of EBPF

In addition to using BPF in user space through nettable and tcpdump, the C write EBPF code can be used directly in the kernel or in other common programming, but requires LLVM support, for example.

In user space by using the Bpf_prog_load method called by the BPF system, it is possible to send EBPF code into the kernel, so the code sent does not need to be converted, because it is in EBPF format. If you want to use EBPF in the kernel space module, you can directly use the corresponding function interface to insert the EBPF program into Sk_buff, providing a powerful filtering capability.

For kernel tracing

We know that EBPF has a map data structure with program execution capability. So this is the perfect tracking framework. For example, through Kprobe a EBPF program into the IO code, monitoring the number of Io, and then through the map to the user space to report specific values. The client only needs to view the map with the BPF system call each time to get the content to be counted. So why use EBPF instead of just using Kprobe's C code itself? This is the security of EBPF, which is designed so that it never crash out of the kernel and does not cross-influence the normal kernel logic. It can be said that the choice of tools avoids many problems that may occur. What is more valuable is that EBPF is a native support Tracepoint, which provides availability for kprobe unstable situations.

The industry's use of EBPF

Brendan Gregg ' s Blog describes an example of kprobe testing using EBPF.
Ktap creative use of the EBPF mechanism to achieve the script of the kernel module, using KTAP, you can directly use the script programming, no need to compile kernel modules, you can implement kernel code tracking and insertion. Behind this is the tracing subsystem of the EBPF and the kernel.
BPF subcommand to perf: Huawei is also adding support capabilities to perf scripts for BPF.
As can be seen, EBPF originated from packet filtration, but is now more and more widely used in the trace market.

Meaning and summary

That is, using the traditional BPF syntax and registers to write the BPF code in the user space, the code is compiled into the EBPF code in the kernel and then compiled into binary execution. Traditional BPF syntax and registers are simple, more business-oriented, similar to high-level programming languages, and the kernel's EBPF syntax and registers are complex, similar to real assembly code.
So why does the kernel have to go through the trouble of implementing such an engine? Because it's lightweight, secure, and portable. Because it is intermediate code, portability does not have to be said, but using kernel modules to invoke the kernel's function interface is generally portable, so this is not a very important reason. EBPF code in the implementation of the process is strictly restricted to prohibit the loop and security review, so that EBPF is strictly located in the provision of procedural execution of the block, even the function is not, and limit the only one EBPF program, maximum no more than 4,096 instructions. So that's where it's positioned: lightweight, secure, non-cyclical.
It says a few of the uses of BPF, but far from it.

Http://www.tcpdump.org/papers/bpf-usenix93.pdf
http://lwn.net/Articles/498231/
Https://www.kernel.org/doc/Documentation/networking/filter.txt

Introduction to Linux Kernel Engineering-network: Filter (LSF, BPF)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.