A bit about history:

  • 1992, BPF bytecode injected from userspace into the kernel. The BPF VM and its bytecode was introduced by Steve McCane and Van Jacobson in their paper The BSD Packet Filter: A New Architecture for User-level Packet Capture, and it was presented for the first time at Usenix Conference Winter ‘93.
  • The Linux kernel features BPF support since v2.5, mainly added by Jay Schullist.
  • 2011, Eric Dumazet turned the BPF interpreter into a JIT (A JIT for packet filters). Instead of interpreting BPF bytecode, now the kernel was able to translate BPF programs directly to a target architecture: x86, ARM, MIPS, etc.
  • 2013, eBPF with new features such as maps and tail calls, old BPF became cBPF
  • 2014, Alexei Starovoitov introduced a new eBPF JIT.

Generally it has two fileds of application:

  1. kernel tracing and event monitoring
  2. network programming

XDP

We need better packet processing performance when working at speeds of 10Gbps or higher. Before BPF, if you wanted to do packet filtering you had to copy all the packets into userspace and then filter them there (with “tap”).

This had 2 problems:

  1. if you filter in userspace, it means you have to copy all the packets into userspace, copying data is expensive. This is called kernel bypass: User-space networking achieves high-speed performance by moving packet-processing out of the kernel’s realm into user-space.
  2. the filtering algorithms people were using were inefficient

The solution to problem #1 seems sort of obvious, move the filtering logic into the kernel somehow. (though the details of how that’s done isn’t obvious) This is called XDP which moves user-space networking programs (filters, mappers, routing, etc) into the kernel’s realm.

The Filter Model

Most applications of a packet capture facility reject far more packets than they accept.

The CSPF(Tree) Model

If you run tcpdump host foo it actually runs a relatively complicated query, which you could represent with this tree:

The CFG model

Evaluating this tree is kind of expensive. so the first insight is that you can actually represent this tree in a simpler way, like this:

These two model are computationally equivalent. However, in implementation they are very different: The tree model maps naturally into code for a stack machine while the CFG model maps naturally into code for a register machine. Since most modern machines are register based, we will argue that the CFG approach lends itself to a more efficient implementation. In the expression tree model, shown above, seven comparison predicates and six boolean operations are required to traverse the entire tree. The longest path through the CFG has five comparison operations, and the average number of comparisons is three.

BPF uses the CFG filter model.

After all, how does the kernel make possible for an user to execute their programs within the kernel’s realm?

With BPF!

BPF

BPF(Berkelet Packet Filtering) has a misleading name but in fact it’s a virtual machine model, which was originally designed for packet filtering processings.

The evolution of eBPF

The original Berkeley Packet Filter (BPF) was designed for capturing and filtering network packets that matched specific rules. Filters are implemented as programs to be run on a register-based virtual machine.

The ability to run user-supplied programs inside of the kernel proved to be a useful design decision but other aspects of the original BPF design didn’t hold up so well. For one, the design of the virtual machine and its instruction set architecture (ISA) were left behind as modern processors moved to 64-bit registers and invented new instructions required for multiprocessor systems, like the atomic exchange-and-add instruction (XADD). BPF’s focus on providing a small number of RISC instructions no longer matched the realities of modern processors.

So, Alexei Starovoitov introduced the extended BPF (eBPF) design to take advantage of advances in modern hardware. The eBPF virtual machine more closely resembles contemporary processors, allowing eBPF instructions to be mapped more closely to the hardware ISA for improved performance. One of the most notable changes was a move to 64-bit registers and an increase in the number of registers from two to ten. Since modern architectures have far more than two registers, this allows parameters to be passed to functions in eBPF virtual machine registers, just like on native hardware. Plus, a new BPF_CALL instruction made it possible to call in-kernel functions cheaply.

The ease of mapping eBPF to native instructions lends itself to just-in-time compilation, yielding improved performance. The original patch that added support for eBPF in the 3.15 kernel showed that eBPF was up to four times faster on x86-64 than the old classic BPF (cBPF) implementation for some network filter microbenchmarks, and most were 1.5 times faster. Many architectures support the just-in-time (JIT) compiler (x86-64, SPARC, PowerPC, ARM, arm64, MIPS, and s390).

Originally, eBPF was only used internally by the kernel and cBPF programs were translated seamlessly under the hood. But with commit daedfb22451d in 2014, the eBPF virtual machine was exposed directly to user space.

For example, tcpdump uses bpf.

1
sudo tcpdump -d "tcp dst port 80"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
(000) ldh      [12]
(001) jeq      #0x86dd          jt 2    jf 6
(002) ldb      [20]
(003) jeq      #0x6             jt 4    jf 15
(004) ldh      [56]
(005) jeq      #0x50            jt 14   jf 15
(006) jeq      #0x800           jt 7    jf 15
(007) ldb      [23]
(008) jeq      #0x6             jt 9    jf 15
(009) ldh      [20]
(010) jset     #0x1fff          jt 15   jf 11
(011) ldxb     4*([14]&0xf)
(012) ldh      [x + 16]
(013) jeq      #0x50            jt 14   jf 15
(014) ret      #262144
(015) ret      #0
  • Instruction (000): loads the packet’s offset 12, as a 16-bit word, into the accumulator. Offset 12 represents a packet’s ethertype.
  • Instruction (001): compares the value of the accumulator to 0x86dd, which is the ethertype value for IPv6. If the result is true, the program counter jumps to instruction (002), if not it jumps to (006).
  • Instruction (006): compares the value to 0x800 (ethertype value of IPv4). If true jump to (007), if not (015). And so forth, until the packet-filtering program returns a result. This result is generally a boolean. Returning a non-zero value (instruction (014)) means the packet matched, whereas returning a zero value (instruction (015)) means the packet didn’t match.

BPF as a VM. It defines an environment where programs are executed. Besides a bytecode, it also defines a packet-based memory model (load instructions are implicitly done on the processing packet), registers (A and X; Accumulator and Index register), a scratch memory store and an implicit Program Counter.

eBPF extends the cBPF VM in several ways:

  • Similar architecture to x86-64. eBPF uses 64-bit registers and increases the number of available registers from 2 (Accumulator and X register) to 10. eBPF also extends the number of opcodes.
  • Decoupled from the networking subsystem. BPF was bounded to a packet-based data model. Since it was used for packet filtering, its code lived within the networking subsystem. However, the eBPF VM is no longer bounded to a data model and it can be used for any purpose. It’s possible to attach now an eBPF program to a tracepoint or to a kprobe. This opens up the door of eBPF to instrumentation, performance analysis and many more uses within other kernel subsystems. The eBPF code lives now at its own path: kernel/bpf.
  • Global data stores called Maps. Maps are a generic data structure that store different types of data in the form of key-value pairs. They allow sharing of data between eBPF kernel programs, and between kernel and user-space applications.
  • Helper functions. Such as packet rewrite, checksum calculation or packet cloning. Unlike user-space programming, these functions get executed inside the kernel. In addition, it’s possible to execute system calls from eBPF programs.
  • Tail-calls. eBPF programs are limited to 4096 bytes. The tail-call feature allows a eBPF program to pass control a new eBPF program, overcoming this limitation.

The eBPF in-kernel verifier

There are inherent security and stability risks with allowing user-space code to run inside the kernel. So, a number of checks are performed on every eBPF program before it is loaded. The first test ensures that the eBPF program terminates and does not contain any loops that could cause the kernel to lock up. This is checked by doing a depth-first search of the program’s control flow graph (CFG). Unreachable instructions are strictly prohibited; any program that contains unreachable instructions will fail to load.

The second stage is more involved and requires the verifier to simulate the execution of the eBPF program one instruction at a time. The virtual machine state is checked before and after the execution of every instruction to ensure that register and stack state are valid. Out of bounds jumps are prohibited, as is accessing out-of-range data.

The verifier doesn’t need to walk every path in the program, it’s smart enough to know when the current state of the program is a subset of one it’s already checked. Since all previous paths must be valid (otherwise the program would already have failed to load), the current path must also be valid. This allows the verifier to “prune” the current branch and skip its simulation.

The verifier also has a “secure mode” that prohibits pointer arithmetic. Secure mode is enabled whenever a user without the CAP_SYS_ADMIN privilege loads an eBPF program. The idea is to make sure that kernel addresses do not leak to unprivileged users and that pointers cannot be written to memory. If secure mode is not enabled, then pointer arithmetic is allowed but only after additional checks are performed. For example, all pointer accesses are checked for type, alignment, and bounds violations.

Registers with uninitialized contents (those that have never been written to) cannot be read; doing so cause the program load to fail. The contents of registers R0-R5 are marked as unreadable across functions calls by storing a special value to catch any reads of an uninitialized register. Similar checks are done for reading variables on the stack and to make sure that no instructions write to the read-only frame-pointer register.

Lastly, the verifier uses the eBPF program type (covered later) to restrict which kernel functions can be called from eBPF programs and which data structures can be accessed. Some program types are allowed to directly access network packet data, for example.

The bpf() system call

Programs are loaded using the bpf() system call with the BPF_PROG_LOAD command. The prototype of the system call is:

1
int bpf(int cmd, union bpf_attr *attr, unsigned int size);

The bpf_attr union allows data to be passed between the kernel and user space; the exact format depends on the cmd argument. The size argument gives the size of the bpf_attr union object in bytes.

Commands are available for creating and modifying eBPF maps; maps are the generic key/value data structure used for communicating between eBPF programs and the kernel or user space. Additional commands allow attaching eBPF programs to a control-group directory or socket file descriptor, iterating over all maps and programs, and pinning eBPF objects to files so that they’re not destroyed when the process that loaded them terminates (the latter is used by the tc classifier/action code so that eBPF programs persist without requiring the loading process to stay alive). The full list of commands can be found in the bpf() man page.

Though there appear to be many different commands, they can be broken down into three categories: commands for working with eBPF programs, working with eBPF maps, or commands for working with both programs and maps (collectively known as objects)

eBPF program types

The type of program loaded with BPF_PROG_LOAD dictates four things: where the program can be attached, which in-kernel helper functions the verifier will allow to be called, whether network packet data can be accessed directly, and the type of object passed as the first argument to the program. In fact, the program type essentially defines an API. New program types have even been created purely to distinguish between different lists of allowed callable functions (BPF_PROG_TYPE_CGROUP_SKB versus BPF_PROG_TYPE_SOCKET_FILTER, for example).

The current set of eBPF program types supported by the kernel is:

  • BPF_PROG_TYPE_SOCKET_FILTER: a network packet filter
  • BPF_PROG_TYPE_KPROBE: determine whether a kprobe should fire or not
  • BPF_PROG_TYPE_SCHED_CLS: a network traffic-control classifier
  • BPF_PROG_TYPE_SCHED_ACT: a network traffic-control action
  • BPF_PROG_TYPE_TRACEPOINT: determine whether a tracepoint should fire or not
  • BPF_PROG_TYPE_XDP: a network packet filter run from the device-driver receive path
  • BPF_PROG_TYPE_PERF_EVENT: determine whether a perf event handler should fire or not
  • BPF_PROG_TYPE_CGROUP_SKB: a network packet filter for control groups
  • BPF_PROG_TYPE_CGROUP_SOCK: a network packet filter for control groups that is allowed to modify socket options
  • BPF_PROG_TYPE_LWT_*: a network packet filter for lightweight tunnels
  • BPF_PROG_TYPE_SOCK_OPS: a program for setting socket parameters
  • BPF_PROG_TYPE_SK_SKB: a network packet filter for forwarding packets between sockets
  • BPF_PROG_CGROUP_DEVICE: determine if a device operation should be permitted or not As new program types were added, kernel developers discovered a need to add new data structures too.

eBPF data structures

The main data structure used by eBPF programs is the eBPF map, a generic data structure that allows data to be passed back and forth within the kernel or between the kernel and user space. As the name “map” implies, data is stored and retrieved using a key.

Maps are created and manipulated using the bpf() system call. When a map is successfully created, a file descriptor associated with that map is returned. Maps are normally destroyed by closing the associated file descriptor. Each map is defined by four values: a type, a maximum number of elements, a value size in bytes, and a key size in bytes. There are different map types and each provides a different behavior and set of tradeoffs:

  • BPF_MAP_TYPE_HASH: a hash table
  • BPF_MAP_TYPE_ARRAY: an array map, optimized for fast lookup speeds, often used for counters
  • BPF_MAP_TYPE_PROG_ARRAY: an array of file descriptors corresponding to eBPF programs; used to implement jump tables and sub-programs to handle specific packet protocols
  • BPF_MAP_TYPE_PERCPU_ARRAY: a per-CPU array, used to implement histograms of latency
  • BPF_MAP_TYPE_PERF_EVENT_ARRAY: stores pointers to struct perf_event, used to read and store perf event counters
  • BPF_MAP_TYPE_CGROUP_ARRAY: stores pointers to control groups
  • BPF_MAP_TYPE_PERCPU_HASH: a per-CPU hash table
  • BPF_MAP_TYPE_LRU_HASH: a hash table that only retains the most recently used items
  • BPF_MAP_TYPE_LRU_PERCPU_HASH: a per-CPU hash table that only retains the most recently used items
  • BPF_MAP_TYPE_LPM_TRIE: a longest-prefix match trie, good for matching IP addresses to a range
  • BPF_MAP_TYPE_STACK_TRACE: stores stack traces
  • BPF_MAP_TYPE_ARRAY_OF_MAPS: a map-in-map data structure
  • BPF_MAP_TYPE_HASH_OF_MAPS: a map-in-map data structure
  • BPF_MAP_TYPE_DEVICE_MAP: for storing and looking up network device references
  • BPF_MAP_TYPE_SOCKET_MAP: stores and looks up sockets and allows socket redirection with BPF helper functions All maps can be accessed from eBPF or user-space programs using the bpf_map_lookup_elem() and bpf_map_update_elem() functions. Some map types, such as socket maps, work with additional eBPF helper functions that perform special tasks.

How to write an eBPF program

Historically, it was necessary to write eBPF assembly by hand and use the kernel’s bpf_asm assembler to generate BPF bytecode. Fortunately, the LLVM Clang compiler has grown support for an eBPF backend that compiles C into bytecode. Object files containing this bytecode can then be directly loaded with the bpf() system call and BPF_PROG_LOAD command.

You can write your own eBPF program in C by compiling with Clang using the -march=bpf parameter. There are plenty of eBPF program examples in the kernel’s samples/bpf/ directory; the majority have a “_kern.c” suffix in their file name. The object file (eBPF bytecode) emitted by Clang needs to be loaded by a program that runs natively on your machine (these samples usually have “_user.c” in their filename). To make it easier to write eBPF programs, the kernel provides the libbpf library, which includes helper functions for loading programs and creating and manipulating eBPF objects. For example, the high-level flow of an eBPF program and user program using libbpf might go something like:

  • Read the eBPF bytecode into a buffer in your user application and pass it to bpf_load_program().
  • The eBPF program, when run by the kernel, will call bpf_map_lookup_elem() to find an element in a map and store a new value in it.
  • The user application calls bpf_map_lookup_elem() to read out the value stored by the eBPF program in the kernel.

However, all of the sample code suffers from one major drawback: you need to compile your eBPF program from within the kernel source tree. Luckily, the BCC project was created to solve this problem. It includes a complete toolchain for writing eBPF programs and loading them without linking against the kernel source tree.

An eBPF example

The Linux kernel sources include several eBPF examples. They’re available at samples/bpf/. To compile these examples simply type:

$ sudo make samples/bpf/

Generally, all the examples at samples/bpf/ consist of 2 files, for example, tracex4_kern.c(the eBPF program) and tracex4_user.c(the main program). The eBPF program defines Maps and functions hooked to a binary section. When the kernel emits a certain type of event (a tracepoint, for instance) our hooks will be executed. Maps are used to exchange data between the kernel program and the user-space program.We need to compile the tracex4_kern.c into eBPF bytecode. The Makefile uses clang to compile teacex4_kern.c into a object file.

1
2
3
4
5
6
7
8
9
10
11
12
13
$(obj)/%.o: $(src)/%.c
	@echo "  CLANG-bpf " $@
	$(Q)$(CLANG) $(NOSTDINC_FLAGS) $(LINUXINCLUDE) $(EXTRA_CFLAGS) -I$(obj) \
		-I$(srctree)/tools/testing/selftests/bpf/ \
		-D__KERNEL__ -D__BPF_TRACING__ -Wno-unused-value -Wno-pointer-sign \
		-D__TARGET_ARCH_$(ARCH) -Wno-compare-distinct-pointer-types \
		-Wno-gnu-variable-sized-type-not-at-end \
		-Wno-address-of-packed-member -Wno-tautological-compare \
		-Wno-unknown-warning-option $(CLANG_ARCH_ARGS) \
		-O2 -emit-llvm -c $< -o -| $(LLC) -march=bpf $(LLC_FLAGS) -filetype=obj -o $@
ifeq ($(DWARF2BTF),y)
	$(BTF_PAHOLE) -J $@
endif

tracex4_kern defines a map.

1
2
3
4
5
6
7
8
9
10
11
struct pair {
	u64 val;
	u64 ip;
};

struct bpf_map_def SEC("maps") my_map = {
	.type = BPF_MAP_TYPE_HASH,
	.key_size = sizeof(long),
	.value_size = sizeof(struct pair),
	.max_entries = 1000000,
};

BPF_MAP_TYPE_HASH is one of the many Maps types offered by eBPF. In this case, it’s simply a hash. You may also have noticed the SEC(“maps”) declaration. SEC is a macro used to create a new section in the binary. The tracex4_kern example defines two more sections:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
SEC("kprobe/kmem_cache_free")
int bpf_prog1(struct pt_regs *ctx)
{
	long ptr = PT_REGS_PARM2(ctx);

	bpf_map_delete_elem(&my_map, &ptr);
	return 0;
}

SEC("kretprobe/kmem_cache_alloc_node")
int bpf_prog2(struct pt_regs *ctx)
{
	long ptr = PT_REGS_RC(ctx);
	long ip = 0;

	/* get ip address of kmem_cache_alloc_node() caller */
	BPF_KRETPROBE_READ_RET_IP(ip, ctx);

	struct pair v = {
		.val = bpf_ktime_get_ns(),
		.ip = ip,
	};

	bpf_map_update_elem(&my_map, &ptr, &v, BPF_ANY);
	return 0;
}

These two functions will allow us to delete an entry from a map (kprobe/kmem_cache_free) and to add a new entry to a map (kretprobe/kmem_cache_alloc_node). All the function calls in capital letters are actually macros defined at bpf_helpers.h.

If I dump the sections of the object file, I should be able to see these new sections defined:

$ objdump -h tracex4_kern.o

tracex4_kern.o:     file format elf64-little

Sections:
Idx Name          Size      VMA               LMA               File off  Algn
  0 .text         00000000  0000000000000000  0000000000000000  00000040  2**2
                  CONTENTS, ALLOC, LOAD, READONLY, CODE
  1 kprobe/kmem_cache_free 00000048  0000000000000000  0000000000000000  00000040  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  2 kretprobe/kmem_cache_alloc_node 000000c0  0000000000000000  0000000000000000  00000088  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, CODE
  3 maps          0000001c  0000000000000000  0000000000000000  00000148  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  4 license       00000004  0000000000000000  0000000000000000  00000164  2**0
                  CONTENTS, ALLOC, LOAD, DATA
  5 version       00000004  0000000000000000  0000000000000000  00000168  2**2
                  CONTENTS, ALLOC, LOAD, DATA
  6 .eh_frame     00000050  0000000000000000  0000000000000000  00000170  2**3
                  CONTENTS, ALLOC, LOAD, RELOC, READONLY, DATAs

Then there is tracex4_user.c, the main program. Basically what the program does is to listen to kmem_cache_alloc_node events. When that event happens, the corresponding eBPF code is executed. The code stores the IP attribute of an object into a map, which is printed in loop in the main program. Example:

$ sudo ./tracex4
obj 0xffff8d6430f60a00 is  2sec old was allocated at ip ffffffff9891ad90
obj 0xffff8d6062ca5e00 is 23sec old was allocated at ip ffffffff98090e8f
obj 0xffff8d5f80161780 is  6sec old was allocated at ip ffffffff98090e8f

How the user-space program and the eBPF program are connected? On initialization, tracex4_user.c loads the tracex4_kern.o object file using the load_bpf_file function. When load_bpf_file is executed, the probes defined in the eBPF file are added to /sys/kernel/debug/tracing/kprobe_events. We’re listening now to those events and our program can do something when they happen.