Linux Kernel Project Introduction-Network: Filter (LSF, BPF, EBPF)

Source: Internet
Author: User
Tags map data structure unix domain socket

Overview

The LSF (Linux socket filter) originates from the BPF (Berkeley Packet filter). The foundation is consistent from the schema. But the use is much simpler. The BPF inside the LSF is the earliest CBPF (classic). Later the x86 platform first switched to EBPF (extended). But because very many upper-level applications still use CBPF (tcpdump, iptables), and EBPF does not support very many platforms, the kernel provides the logic to convert from CBPF to EBPF, and EBPF uses a very much CBPF instruction encoding when it is designed.

But in the instruction set register. There is a very big difference in architecture design (such as EBPF has been able to call C functions, and can jump to another EBPF program).
But as soon as the new EBPF came out, they were very quick to discover the meaning of the kernel trace, which ensured that the kernel running information was completely secure. is the perfect choice for kernel debugging and developers. So for this aspect, such as Kprobe, Ktap, perf EBPF and other excellent work gradually produced. Instead, there are not enough people in the filter department to pay attention.

TC (Traffic controll) is an excellent user-side program using EBPF, which agrees to dynamically add and remove new traffic control algorithms without having to compile the module again. NetFilter xtable module, with XT_BPF module. Can be implemented to add the EBPF program to the hook point. To implement filtering. Of course, functions compiled from CBPF to EBPF are available in the kernel, so it is possible to use CBPF under any circumstances. The kernel will proactively detect and compile itself.

BPF Main uses

The core principle is to provide the user with two socket options: So_attach_filter and SO_ATTACH_BPF. The user is agreed to add a filter of their own definition to a sokcet, and only packets that meet the criteria specified by the filter are sent to the user space.

Because there are so many sokect, you can add such a filter to sockets in each dimension. Assume that you are adding a raw socket. is capable of filtering based on all IP packets (tcpdump is the principle), assuming you want to make an HTTP analysis tool, you can add a filter to a socket based on 80port (or other HTTP listening port). Another way to use the offline type. Using the Libpcap capture package is stored locally and can then be analyzed offline using the BPF code, which is useful for experimenting with new rules and testing the BPF program: So_attach_filter inserts the CBPF code. SO_ATTACH_BPF is inserting the EBPF code. EBPF is the enhancement of the CBPF, the current user side of the Tcpdump program or CBPF version number, which is loaded into the kernel will be the core of their own initiative into EBPF.

2/proc/sys/net/core/bpf_jit_enable

By writing 0/1/2 like this, you can implement the BPF mode such as close, open, debug log, and so on.

Used in user space. The simplest way is to use the Libpcap engine. Because BPF is a compilation-type language, it is more difficult to write on its own, so LIBPCAP provides some upper-level packages that can be called directly. However, Libpcap does not provide all the requirements, such as the testing needs of the BPF module developers, and the need for a high-end self-defined BPF script. In this case, you need to write the BPF code yourself, and then use the tools under the Kernel tools/net/folder to compile the BPF assembly code. You can then use the socket interface to pass in the code.

The BPF engine is implemented in the kernel. However, the working location of the BPF program requires additional modules to support. Often used with NetFilter xtable, XT_BPF can now NetFilter Hook Point run BPF program, CLS_BPF and ACT_BPF to achieve the classification and discard Traffic (QoS).
The full support of the kernel for BPF is starting from 3.9, as part of the iptables, the default is XT_BPF, and the client's library is LIBXT_BPF. Iptables the way the rules are managed is a sequential run. This kind of operation will inevitably bring a performance bottleneck when the number of matches is many, after joining the BPF support. Flexibility is greatly improved.

All of the above mentioned areas where BPF can be used refer to the use of EBPF and CBPF at the same time. Because the kernel will proactively check whether the encoding needs to be converted before it is run.

Other BPF Programs

The BPF program is used to do packet filtering, then the BPF code can only be used to do packet filtering? Not also. The kernel's BPF support is an infrastructure. is just a way of expressing the middle code. is to provide a public interface to the user space that injects the running code into the kernel. It's just that most of the applications now use this interface to do packet filtering. Other such as Seccomp BPF can be used to restrict the user process can use the system calls, CLS_BPF can be used to classify traffic, PTP dissector/classifier (do not know) and so on are using the kernel of the EBPF language architecture to achieve their own purposes, is not necessarily a packet filtering function.

User Space BPF Support

Tools: Tcpdump, Tools/net, Cloudfare, Seccomp BPF, IO visitor, Ktap

CBPF Assembly Architecture Analysis

Each assembly instruction in CBPF is for example the following format:

struct sock_filter {    /* Filter block */    __u16   code;   /* Actual filter code */    __u8    /* Jump true */    __u8    /* Jump false */    __u32   k;      /* Generic multiuse field */};

One: Op:16, Jt:8, Jf:8, k:32
Code is the real assembly instruction, JT is the jump that the instruction result is true, the JF is a jump for false, K is the instruction's parameter, according to the instruction is different.

A BPF program is compiled as an array of sock_filter, and can be programmed using the syntax of a similar assembly, and then compiled using the BPF_ASM program provided by the kernel.
The BPF is actually a virtual machine in the kernel. Has its own defined set of virtual registers. is consistent with the principles of our familiar Java virtual machines.

The design of this virtual machine is where LSF's success lies. CBPF has 3 types of registers:

  A           32位,全部载入指令的目的地址和全部指令运算结果的存储地址  X           32位。二元指令计算A中參数的辅助寄存器(比如移位的位数。除法的除数)  M[]         0-15共16个32位寄存器。能够自由使用

The most common way to use it is to infer from the data in the data packet that takes a word. According to BPF, we are able to use offsets to specify the location of a packet, and very many protocols are not often used and fixed. such as port and IP address, BPF provides us with some variables that are defined in advance. Just use this variable to directly take the value to the corresponding packet location. Like what:

Len Skb->len Proto Skb->protocol type skb->pkt_typePoffPayloadStart offset Ifidx skb->dev->ifindex NLANetLinkAttribute of  type X with offset A NlanNested NetLinkAttribute of  type X with offset A Mark Skb->mark Queue skb->queue_mapping Hatype Skb->dev->typeRxhash Skb->hash CPU raw_smp_processor_id () VLAN_TCI Skb_vlan_tag_get (SKB) vlan_avail skb_vlan_tag_present (SKB) VLAN _tpid Skb->vlan_proto Rand Prandom_u32 ()

More valuable is that this list can also be extended by the user himself. The detailed implementation of the various BPF engines also defines the respective extensions.

EBPF Assembly Architecture Analysis

Because the user is able to submit CBPF code, the first is to compile the struct array submitted by the user into EBPF code (EBPF is not necessary). Then turn the EBPF code into a binary that can be run directly. CBPF This is still used in very many platforms, and the code is the same as that used in user space. But in the X86 architecture. Now the kernel has switched to using EBPF as the intermediate language. This means that the x86 used in user space is not the same as that used in kernel space.

But the kernel in the definition of EBPF has tried to reuse the CBPF encoding, and some of the instruction encoding and meaning, such as bpf_ld are all the same. However, in a platform that does not yet support EBPF, CBPF is the only code that can run directly. No need to convert to EBPF.


EBPF the expression for each BPF statement is slightly different from CBPF, such as the following definition:

struct bpf_insn {    __u8    code;       /* opcode */    __u8    dst_reg:4;  /* dest register */    __u8    src_reg:4;  /* source register */    __s16   off;        /* signed offset */    __s32   imm;        /* signed immediate constant */};

The registers are also different:

* R0-return value  from inch-kernel function,  and exit value  for ebpf   Program* R1-r5-arguments fromEBPF Program to inch-kernelfunction* R6-r9-callee saved registers thatinch-kernel function  would preserve * R10-Read-only Frame pointer toAccess stack

In order to tie in with more powerful features, the EBPF assembly architecture uses registers that have been added to the presence of the above registers. Fully embodies the concept of function invocation. It is no longer the original logic for loading processing. The logical setting of a function call can directly invoke functions inside the kernel (this is a security risk.) But there is an internal circumvention mechanism). Not only that, because the register architecture is very much like the real register architecture of CPUs such as x86, the actual implementation is the direct register mapping, which means that these virtual registers are actually using the same function as the real registers. This is undoubtedly a great improvement in efficiency. And. On a 64-bit computer, these computers will have a width of 64 bits. Perfect to play with hardware capabilities. But the 64-bit support is not very good at the moment. But it's already available.


The kernel implementation of the moment. It is only possible to invoke pre-defined kernel functions in the EBPF program and not be able to invoke other EBPF programs (but can jump to other EBPF programs via map support). And then jump back, there's a description later). It doesn't seem to matter. But it is a great ability. This means that you can use the C language to implement the EBPF program logic. EBPF just need to call this C function.

EBPF Data Interaction: Map

EBPF is not a program, it is also able to access external data, it is important that this external data can be managed in user space. The map data body in this k-v format is managed by invoking the BPF system call creation, join, delete, etc. in user space.


Users can define multiple maps at the same time and use FD to access a map.

There is a special kind of map. Called program Arry, this map is stored in other EBPF programs FD, through this map can achieve EBPF between the jump, jump away will not jump back, the maximum depth is 32, so that prevents the generation of infinite loops (that is, can use this mechanism to achieve a finite loop). More importantly, this map can invoke dynamic changes through the BPF system at runtime, which provides a powerful dynamic programming capability. For example, the ability to implement a large process function in the middle of a process change. In fact, there are 3 types of maps in common:

//hash类型BPF_MAP_TYPE_ARRAY,  //数组类型BPF_MAP_TYPE_PROG_ARRAY,  //程序表类型
Direct programming method of EBPF

In addition to using BPF in user space through nettable and tcpdump, the C write EBPF code can be used directly in the kernel or in other common programming. But need LLVM support, example.



The Bpf_prog_load method is called in user space by using the BPF system. will be able to send EBPF code into the kernel, so the code sent does not need to do the conversion, because it is the EBPF format itself. Assuming that the kernel space module uses EBPF, it is possible to insert the EBPF program directly into the sk_buff using the corresponding function interface, providing powerful filtering capabilities.


Linux provides system invoke BPF, which is used to manipulate EBPF related kernel parts:

#include <linux/bpf.h>int bpf(intunionunsignedint size);

BPF man page
The first parameter cmd of this function is the type of operation supported by the kernel. Includes Bpf_map_create, Bpf_map_lookup_elem, Bpf_map_update_elem, Bpf_map_delete_elem, Bpf_map_get_next_key, BPF_PROG_ LOAD 6 kinds. However, it can be found from the name that there are 5 types that are used to manipulate the map.

This map says earlier. Is the only way that the user program communicates with the kernel EBPF program. These 5 invocation types are used by programs for user space. The last Bpf_prog_load method is used to load the EBPF code body into the kernel.


The second parameter, attr, is the detailed parameter of the cmd parameter, which differs depending on the CMD, assuming that the load also contains the complete EBPF program.
It's worth noting that. Each map and EBPF is a file. Have the corresponding FD, this FD in user space appears to be the same as other FD. The ability to release can be passed through the UNIX domain socket between processes. Suppose you define a socket of the raw type. Attach the EBPF program filter on it. It can even be used directly as a iptable rule.

Kernel module subsystems related to BPF in the kernel

Act_bpf
Cls_bpf
IO Visitor: This is probably the largest system based on EBPF-related. By a number of vendors to participate.
Xtable, XT_BPF

BPF for Kernel tracing

We know that EBPF has a map data structure and has a program running capability. So this is the perfect tracking framework. For example, by kprobe a EBPF program into IO code to monitor the number of IO. Detailed values are then reported to the user space via map. The client only needs to view the map with the BPF system call every time to get the content to be counted. So why use EBPF instead of just using Kprobe's C code itself? This is the security of EBPF. Its mechanism is designed so that it never crash out of the kernel. does not cross-influence the normal kernel logic.

To be able to say, the choice of tools avoids a lot of problems that can happen.

What is more valuable is that EBPF is a native support Tracepoint, which provides availability for kprobe unstable situations.

The industry's tracing use of EBPF

Brendan Gregg's Blog Describes an example of a kprobe test using EBPF.
Ktap creatively uses the EBPF mechanism to enable scripting of kernel modules, using KTAP, you can use scripting directly. Kernel code Tracking and insertion is possible without the need to compile kernel modules. Behind this is the tracing subsystem of the EBPF and the kernel.
BPF subcommand to perf: Huawei is also adding support for BPF to the perf script.
Can be seen, EBPF originated from packet filtration, but now in the trace market has been more and more widely used.

Meaning and summary

This means that the BPF code is written in user space using traditional BPF syntax and registers. The code is compiled into EBPF code in the kernel and then compiled into binary run. Traditional BPF syntax and registers are simple, more business-oriented, similar to high-level programming languages, and the kernel's EBPF syntax and registers are complex. Similar to the real assembly code.


So why does the kernel have to go through the trouble of implementing such an engine? Because it's lightweight, secure, and portable.

Because it is an intermediate code, portability does not have to be said, but the function interfaces that use kernel modules to invoke the kernel are generally portable. So this is not a very important reason. EBPF code in the process of operation is strictly restricted to prohibit the cycle and security review, so that the EBPF is strictly located in the provision of the process of running statements block, even the function is not, the maximum of 4,096 instructions. So that's where it's positioned: lightweight, secure, non-cyclical.
It says a few uses of BPF. But that's far from it.

Http://www.tcpdump.org/papers/bpf-usenix93.pdf
http://lwn.net/Articles/498231/
Https://www.kernel.org/doc/Documentation/networking/filter.txt

Linux Kernel Project Introduction-Network: Filter (LSF, BPF, EBPF)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.