tracee-ebpf: use cgroup id for container id resolution #1130

yanivagman · 2021-11-09T11:02:37Z

This PR changes the way we extract container id for events.

The current solution has several problems:

Container id is saved per process, wasting memory.
Container id is only 12 chars long, while the full container id (64 chars long) might be required.
Bpf code is responsible for extracting container id from a cgroup name (we now check, for example, if cgroup name has "docker-" prefix, but this doesn't always work, e.g. with podman)
Regexes used during container id map initialization don't match the container ids we extract in runtime (as bpf code can't handle regexes)

To solve these problems, let's move to use cgroup id to get a container id:

In the context of every event, send task's cgroup id (instead of 12 chars container id)
Add a new map in userspace that maps cgroup id to container id
Update this map on init with existing containers by knowing that the lower 32 bits of the cgroup_id are actually the inode number in the cgroupfs entry
Add two new events (tracepoints) to track cgroup creation/removal: cgroup_mkdir and cgroup_rmdir
Using these two events, update cgroup to container id map in runtime if a match to a known container runtime is found
Use cgroup_id as an index to the map to get the container id of a given event

rafaeldtinoco · 2021-11-10T02:24:40Z

Good thing about having BTFHub embedded files now is that after rebasing with my dev branch Im able to quickly test it in multiple kernels... with that said, it looks like this code has issues loading in 5.4 kernel:

; struct cgroup *dst_cgrp = (struct cgroup*)ctx->args[0];
308: (79) r3 = *(u64 *)(r7 +0)
309: (b7) r1 = 288
310: (0f) r3 += r1
; char *path = (char*)ctx->args[1];
311: (79) r7 = *(u64 *)(r7 +8)
; struct kernfs_node *kn = READ_KERN(cgrp->kn);
312: (7b) *(u64 *)(r10 -72) = r6
313: (bf) r1 = r10
; struct cgroup *dst_cgrp = (struct cgroup*)ctx->args[0];
314: (07) r1 += -72
; struct kernfs_node *kn = READ_KERN(cgrp->kn);
315: (b7) r2 = 8
316: (85) call bpf_probe_read#4
last_idx 316 first_idx 294
regs=4 stack=0 before 315: (b7) r2 = 8
317: (79) r3 = *(u64 *)(r10 -72)
; if (kn == NULL)
318: (15) if r3 == 0x0 goto pc+7
 R0=inv(id=0) R3_w=inv(id=0) R6=invP0 R7=inv(id=0) R8=inv(id=0) R9=inv2344 R10=fp0 fp-8=???????m fp-16=mmmmmmmm fp-24=mmmmmmmm fp-32=mmmmmmmm fp-40=mmmmmmmm fp-48=mmmmmmmm fp-56=mmmmmmmm fp-64=mmmmmmmm fp-72=mmmmmmmm fp-80=0000mmmm fp-88=map_value fp-96=ctx fp-104=00000000 fp-112=00000000 fp-120=0000mmmm fp-128=mmmmmmmm fp-136=mmmmmmmm fp-144=mmmmmmmm fp-152=mmmmmmmm fp-160=mmmmmmmm fp-168=mmmmmmmm fp-176=mmmmmmmm fp-184=mmmmmmmm fp-192=mmmmmmmm fp-200=mmmmmmmm fp-208=mmmmmmmm fp-216=ctx fp-224=mmmmmmmm fp-232=mmmmmmmm
319: (85) call unknown#195896080
invalid func unknown#195896080
processed 319 insns (limit 1000000) max_states_per_insn 0 total_states 18 peak_states 18 mark_read 16

libbpf: -- END LOG --
libbpf: failed to load program 'tracepoint__cgroup__cgroup_mkdir'
libbpf: failed to load object 'embedded-core'
2021/11/10 02:20:52 error creating Tracee: failed to load BPF object

It works fine in kernel 5.8. Checking the code now...

rafaeldtinoco

The overall change is VERY nice. I liked the approach of having cgroup_id <-> container_id relationship maintained by the cgroup events. After you deal with the 5.4 relocation error, I'm ready to +1.

tracee-ebpf/tracee/tracee.bpf.c

rafaeldtinoco · 2021-11-19T07:47:30Z

@yanivagman, Seems that I was able to add the type relocation feature in btfgen:

kernfs_node->id is an union in kernels <= 5.4:

$ sudo bpftool btf dump file ./5.4.0-87-generic.btf format raw  | less

[71] STRUCT 'kernfs_node' size=128 vlen=1
        'id' type_id=65 bits_offset=832

[65] UNION 'kernfs_node_id' size=8 vlen=2
        '(anon)' type_id=123 bits_offset=0
        'id' type_id=38 bits_offset=0

[123] STRUCT '(anon)' size=8 vlen=2
        'ino' type_id=1 bits_offset=0
        'generation' type_id=1 bits_offset=32

[1] TYPEDEF 'u32' type_id=30
[30] TYPEDEF '__u32' type_id=78
[78] INT 'unsigned int' size=4 bits_offset=0 nr_bits=32 encoding=(none)

[38] TYPEDEF 'u64' type_id=11
[11] TYPEDEF '__u64' type_id=98
[98] INT 'long long unsigned int' size=8 bits_offset=0 nr_bits=64 encoding=(none)

kernfs_node->id is an integer in kernels > 5.4:

$ sudo bpftool btf dump file ./5.13.0-20-generic.btf format raw | less

[59] STRUCT 'kernfs_node' size=128 vlen=1
        'id' type_id=82 bits_offset=832

[82] TYPEDEF 'u64' type_id=88
[88] TYPEDEF '__u64' type_id=34
[34] INT 'long long unsigned int' size=8 bits_offset=0 nr_bits=64 encoding=(none)

So let's consider that part "okay" (unless there is a bug in my code). Now, I would like to mention something else observed:

rafaeldtinoco · 2021-11-19T07:48:14Z

could you try to use the FULL BTF file (from BTFHub) in a 5.4.0 kernel and see if you're getting the "container_id" value ?

$ sudo TRACEE_BTF_FILE=/home/rafaeldtinoco/work/sources/ebpf/btfgen/btfs/5.4.0-87-generic.btf ./dist/tracee-ebpf --debug --trace container -trace event=openat,openat2
OSInfo: VERSION: "18.04.6 LTS (Bionic Beaver)"
OSInfo: ID: ubuntu
OSInfo: ID_LIKE: debian
OSInfo: PRETTY_NAME: "Ubuntu 18.04.6 LTS"
OSInfo: VERSION_ID: "18.04"
OSInfo: VERSION_CODENAME: bionic
OSInfo: KERNEL_RELEASE: 5.4.0-87-generic
BTF: bpfenv = false, btfenv = true, vmlinux = false
BPF: using embedded BPF object
BTF: using BTF file from environment: /home/rafaeldtinoco/work/sources/ebpf/btfgen/btfs/5.4.0-87-generic.btf
unpacked CO:RE bpf object file into memory
TIME             CONTAINER_ID  UID    COMM             PID/host        TID/host        RET              EVENT                ARGS
07:26:09:565417                0      bash             1      /2019    1      /2019    3                openat               dirfd: -100, pathname: /etc/ld.so.cache, flags: O_RDONLY|O_CLOEXEC, mode: 0
07:26:09:565483                0      bash             1      /2019    1      /2019    3                openat               dirfd: -100, pathname: /lib/x86_64-linux-gnu/libtinfo.so.6, flags: O_RDONLY|O_CLOEXEC, mode: 0
07:26:09:565612                0      bash             1      /2019    1      /2019    3                openat               dirfd: -100, pathname: /lib/x86_64-linux-gnu/libc.so.6, flags: O_RDONLY|O_CLOEXEC, mode: 0
07:26:09:566305                0      bash             1      /2019    1      /2019    3                openat               dirfd: -100, pathname: /dev/tty, flags: O_RDWR|O_NONBLOCK, mode: 0
07:26:09:566897                0      bash             1      /2019    1      /2019    3                openat               dirfd: -100, pathname: /etc/nsswitch.conf, flags: O_RDONLY|O_CLOEXEC, mode: 0
07:26:09:567006                0      bash             1      /2019    1      /2019    3                openat               dirfd: -100, pathname: /etc/passwd, flags: O_RDONLY|O_CLOEXEC, mode: 0
07:26:09:567519                0      bash             1      /2019    1      /2019    3                openat               dirfd: -100, pathname: /etc/bash.bashrc, flags: O_RDONLY, mode: 0
07:26:09:569169                0      groups           9      /2079    9      /2079    3                openat               dirfd: -100, pathname: /etc/ld.so.cache, flags: O_RDONLY|O_CLOEXEC, mode: 0

doing a bpf_printk() i was able to get the following:

            runc-2675    [003] d...  2745.657214: 0: id = 4294967408
            runc-2675    [003] d...  2745.659337: 0: id = 4294967835
            runc-2675    [003] d...  2745.659748: 0: id = 4294969175
            runc-2675    [003] d...  2745.660197: 0: id = 4294968570
            runc-2675    [003] d...  2745.660759: 0: id = 4294967896
            runc-2675    [003] d...  2745.661054: 0: id = 4294968237
            runc-2675    [003] d...  2745.661345: 0: id = 4294967377
            runc-2675    [003] d...  2745.661690: 0: id = 4294967347
            runc-2675    [003] d...  2745.662122: 0: id = 4294967329
            runc-2675    [003] d...  2745.662458: 0: id = 4294967344
            runc-2675    [003] d...  2745.662700: 0: id = 4294967684

but the container_id is zeroed.

rafaeldtinoco · 2021-11-19T07:53:09Z

I'm able to get the container id if using the full BTF file for 5.13 kernel, for example:

$ sudo TRACEE_BTF_FILE=/home/rafaeldtinoco/work/sources/ebpf/btfgen/btfs/5.13.0-20-generic.btf ./dist/tracee-ebpf --debug --trace container -trace event=openat,openat2
OSInfo: KERNEL_RELEASE: 5.13.0-20-generic
OSInfo: PRETTY_NAME: "Ubuntu 21.10"
OSInfo: VERSION_ID: "21.10"
OSInfo: VERSION: "21.10 (Impish Indri)"
OSInfo: VERSION_CODENAME: impish
OSInfo: ID: ubuntu
OSInfo: ID_LIKE: debian
BTF: bpfenv = false, btfenv = true, vmlinux = true
BPF: using embedded BPF object
BTF: using BTF file from environment: /home/rafaeldtinoco/work/sources/ebpf/btfgen/btfs/5.13.0-20-generic.btf
unpacked CO:RE bpf object file into memory
TIME             CONTAINER_ID  UID    COMM             PID/host        TID/host        RET              EVENT                ARGS
04:52:27:510010  5040962671ac  0      bash             1      /3966413 1      /3966413 3                openat               dirfd: -100, pathname: /etc/ld.so.cache, flags: O_RDONLY|O_CLOEXEC, mode: 0
04:52:27:510301  5040962671ac  0      bash             1      /3966413 1      /3966413 3                openat               dirfd: -100, pathname: /lib/x86_64-linux-gnu/libtinfo.so.6, flags: O_RDONLY|O_CLOEXEC, mode: 0
04:52:27:510980  5040962671ac  0      bash             1      /3966413 1      /3966413 3                openat               dirfd: -100, pathname: /lib/x86_64-linux-gnu/libc.so.6, flags: O_RDONLY|O_CLOEXEC, mode: 0
04:52:27:516075  5040962671ac  0      bash             1      /3966413 1      /3966413 3                openat               dirfd: -100, pathname: /dev/tty, flags: O_RDWR|O_NONBLOCK, mode: 0
04:52:27:521596  5040962671ac  0      bash             1      /3966413 1      /3966413 3                openat               dirfd: -100, pathname: /etc/nsswitch.conf, flags: O_RDONLY|O_CLOEXEC, mode: 0
04:52:27:521795  5040962671ac  0      bash             1      /3966413 1      /3966413 3                openat               dirfd: -100, pathname: /etc/passwd, flags: O_RDONLY|O_CLOEXEC, mode: 0

yanivagman · 2021-11-21T12:37:06Z

Thanks @rafaeldtinoco for pointing me to this problem!
It turns out that my solution doesn't handle cgroup v1 properly.
Ubuntu 18.04.6 uses cgroup v1 by default, so you encountered this problem. Luckily for us, it is easy to check if cgroup v2 is enabled or not in the system: https://docs.docker.com/config/containers/runmetrics/#enumerate-cgroups

The problem is that bpf_get_current_cgroup_id() helper returns cgroup v2 id, which doesn't match any of the cgroup ids found while walking cgroupfs (v1).

The solution I have in mind for this:

On tracee init - check if cgroup v2 is enabled or not (by checking for the existence of /sys/fs/cgroup/cgroup.controllers)
Init bpf config map with cgroup v1/v2 enabled flag
If cgroup v2 - use current logic
Otherwise, for each event context, get cgroup id of subsystem 0 (cpuset) (task->cgroups->subsys[0]->cgroup->kn->id)
As we use a specific cgroup subsystem (cpuset), only parse cpuset folders when iterating cgroup v1 on init and when cgroup_mkdir is received

WDYT?

close aquasecurity#473 fix aquasecurity#958

yanivagman · 2021-11-21T13:37:08Z

One more point that we should add to the docs:
If the user wants a correct enumeration of existing containers (whether cgroup v1/v2), he should run tracee in the host cgroup namespace (cgroupns=host in docker).
This point is also true today, when we walk cgroupfs on init.

yanivagman · 2021-11-21T16:35:18Z

Note about my recent optimization in the last commit (cgroup v1 XOR v2, not both), taken from https://medium.com/nttlabs/cgroup-v2-596d035be4d7 : "cgroup v1 and v2 are incompatible and can’t be enabled simultaneously. Although there is “hybrid” configuration that allows mounting both v1 hierarchy and v2 hierarchy, the “hybrid” mode is underutilized for containers because you can’t enable v2 controllers that are already enabled for v1."

rafaeldtinoco · 2021-11-22T11:59:57Z

Thanks @yanivagman, I'm reviewing this today and will let you know.

rafaeldtinoco

Yep, it seems to be working now. I agree to the rational discussed about cgroups v1 and this commits looks good to me. A SPECIAL ATTENTION to this commit is that we're starting to use type-based checks (which will require btfgen to support type-based relocations).

I'll provide the changes at:

kinvolk/btfgen#13

rafaeldtinoco · 2021-11-22T17:19:50Z

5. As we use a specific cgroup subsystem (cpuset), only parse cpuset folders when iterating cgroup v1 on init and when cgroup_mkdir is received

Yes, I think that is the secret. To rely only in a single cgroup subset and, yep, seems like cpuset is the one that should be taken.

rafaeldtinoco · 2021-11-22T17:20:30Z

Sorry I accidentally closed the PR. Re-opened it.

yanivagman requested review from rafaeldtinoco and grantseltzer November 9, 2021 11:02

rafaeldtinoco reviewed Nov 10, 2021

View reviewed changes

tracee-ebpf/tracee/tracee.bpf.c Outdated Show resolved Hide resolved

rafaeldtinoco mentioned this pull request Nov 10, 2021

BTFHub support #1125

Closed

yanivagman force-pushed the cgroup_to_container_id branch 4 times, most recently from 2450c30 to a8c57dd Compare November 17, 2021 20:22

yanivagman added 2 commits November 21, 2021 15:22

tracee-ebpf: use cgroup id for container id resolution

8ab5832

close aquasecurity#473 fix aquasecurity#958

tracee-ebpf: read cgroup id correctly with CO-RE

b537fe3

yanivagman force-pushed the cgroup_to_container_id branch from a8c57dd to 80b8c63 Compare November 21, 2021 15:30

tracee-ebpf: handle cgroup v1 containers

d850e35

yanivagman force-pushed the cgroup_to_container_id branch from 80b8c63 to d850e35 Compare November 21, 2021 15:40

rafaeldtinoco mentioned this pull request Nov 22, 2021

[RFE] btfgen to support type based relocations kinvolk/btfgen#13

Closed

rafaeldtinoco approved these changes Nov 22, 2021

View reviewed changes

rafaeldtinoco closed this Nov 22, 2021

rafaeldtinoco reopened this Nov 22, 2021

yanivagman merged commit d421bb9 into aquasecurity:main Nov 22, 2021

yanivagman deleted the cgroup_to_container_id branch November 22, 2021 17:28

yanivagman mentioned this pull request Nov 23, 2021

Support other container engines #884

Closed

This was referenced Nov 24, 2021

Don't use (compilation time) LINUX_VERSION_CODE variable for CO-RE builds #1132

Closed

Error building docker and docker-slim images #1165

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tracee-ebpf: use cgroup id for container id resolution #1130

tracee-ebpf: use cgroup id for container id resolution #1130

tracee-ebpf: use cgroup id for container id resolution #1130

tracee-ebpf: use cgroup id for container id resolution #1130

Conversation

Choose a reason for hiding this comment

Choose a reason for hiding this comment