8000 tracee-ebpf: use cgroup id for container id resolution by yanivagman · Pull Request #1130 · aquasecurity/tracee · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

tracee-ebpf: use cgroup id for container id resolution #1130

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Nov 22, 2021

Conversation

yanivagman
Copy link
Collaborator

This PR changes the way we extract container id for events.

The current solution has several problems:

  1. Container id is saved per process, wasting memory.
  2. Container id is only 12 chars long, while the full container id (64 chars long) might be required.
  3. Bpf code is responsible for extracting container id from a cgroup name (we now check, for example, if cgroup name has "docker-" prefix, but this doesn't always work, e.g. with podman)
  4. Regexes used during container id map initialization don't match the container ids we extract in runtime (as bpf code can't handle regexes)

To solve these problems, let's move to use cgroup id to get a container id:

  1. In the context of every event, send task's cgroup id (instead of 12 chars container id)
  2. Add a new map in userspace that maps cgroup id to container id
  3. Update this map on init with existing containers by knowing that the lower 32 bits of the cgroup_id are actually the inode number in the cgroupfs entry
  4. Add two new events (tracepoints) to track cgroup creation/removal: cgroup_mkdir and cgroup_rmdir
  5. Using these two events, update cgroup to container id map in runtime if a match to a known container runtime is found
  6. Use cgroup_id as an index to the map to get the container id of a given event

close #473
fix #958

@rafaeldtinoco
Copy link
Contributor
rafaeldtinoco commented Nov 10, 2021

Good thing about having BTFHub embedded files now is that after rebasing with my dev branch Im able to quickly test it in multiple kernels... with that said, it looks like this code has issues loading in 5.4 kernel:

; struct cgroup *dst_cgrp = (struct cgroup*)ctx->args[0];
308: (79) r3 = *(u64 *)(r7 +0)
309: (b7) r1 = 288
310: (0f) r3 += r1
; char *path = (char*)ctx->args[1];
311: (79) r7 = *(u64 *)(r7 +8)
; struct kernfs_node *kn = READ_KERN(cgrp->kn);
312: (7b) *(u64 *)(r10 -72) = r6
313: (bf) r1 = r10
; struct cgroup *dst_cgrp = (struct cgroup*)ctx->args[0];
314: (07) r1 += -72
; struct kernfs_node *kn = READ_KERN(cgrp->kn);
315: (b7) r2 = 8
316: (85) call bpf_probe_read#4
last_idx 316 first_idx 294
regs=4 stack=0 before 315: (b7) r2 = 8
317: (79) r3 = *(u64 *)(r10 -72)
; if (kn == NULL)
318: (15) if r3 == 0x0 goto pc+7
 R0=inv(id=0) R3_w=inv(id=0) R6=invP0 R7=inv(id=0) R8=inv(id=0) R9=inv2344 R10=fp0 fp-8=???????m fp-16=mmmmmmmm fp-24=mmmmmmmm fp-32=mmmmmmmm fp-40=mmmmmmmm fp-48=mmmmmmmm fp-56=mmmmmmmm fp-64=mmmmmmmm fp-72=mmmmmmmm fp-80=0000mmmm fp-88=map_value fp-96=ctx fp-104=00000000 fp-112=00000000 fp-120=0000mmmm fp-128=mmmmmmmm fp-136=mmmmmmmm fp-144=mmmmmmmm fp-152=mmmmmmmm fp-160=mmmmmmmm fp-168=mmmmmmmm fp-176=mmmmmmmm fp-184=mmmmmmmm fp-192=mmmmmmmm fp-200=mmmmmmmm fp-208=mmmmmmmm fp-216=ctx fp-224=mmmmmmmm fp-232=mmmmmmmm
319: (85) call unknown#195896080
invalid func unknown#195896080
processed 319 insns (limit 1000000) max_states_per_insn 0 total_states 18 peak_states 18 mark_read 16

libbpf: -- END LOG --
libbpf: failed to load program 'tracepoint__cgroup__cgroup_mkdir'
libbpf: failed to load object 'embedded-core'
2021/11/10 02:20:52 error creating Tracee: failed to load BPF object

It works fine in kernel 5.8. Checking the code now...

Copy link
Contributor
@rafaeldtinoco rafaeldtinoco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The overall change is VERY nice. I liked the approach of having cgroup_id <-> container_id relationship maintained by the cgroup events. After you deal with the 5.4 relocation error, I'm ready to +1.

@rafaeldtinoco rafaeldtinoco mentioned this pull request Nov 10, 2021
@yanivagman yanivagman force-pushed the cgroup_to_container_id branch 4 times, most recently from 2450c30 to a8c57dd Compare November 17, 2021 20:22
@rafaeldtinoco
Copy link
Contributor
rafaeldtinoco commented Nov 19, 2021

@yanivagman, Seems that I was able to add the type relocation feature in btfgen:

kernfs_node->id is an union in kernels <= 5.4:

$ sudo bpftool btf dump file ./5.4.0-87-generic.btf format raw  | less

[71] STRUCT 'kernfs_node' size=128 vlen=1
        'id' type_id=65 bits_offset=832

[65] UNION 'kernfs_node_id' size=8 vlen=2
        '(anon)' type_id=123 bits_offset=0
        'id' type_id=38 bits_offset=0

[123] STRUCT '(anon)' size=8 vlen=2
        'ino' type_id=1 bits_offset=0
        'generation' type_id=1 bits_offset=32

[1] TYPEDEF 'u32' type_id=30
[30] TYPEDEF '__u32' type_id=78
[78] INT 'unsigned int' size=4 bits_offset=0 nr_bits=32 encoding=(none)

[38] TYPEDEF 'u64' type_id=11
[11] TYPEDEF '__u64' type_id=98
[98] INT 'long long unsigned int' size=8 bits_offset=0 nr_bits=64 encoding=(none)

kernfs_node->id is an integer in kernels > 5.4:

$ sudo bpftool btf dump file ./5.13.0-20-generic.btf format raw | less

[59] STRUCT 'kernfs_node' size=128 vlen=1
        'id' type_id=82 bits_offset=832

[82] TYPEDEF 'u64' type_id=88
[88] TYPEDEF '__u64' type_id=34
[34] INT 'long long unsigned int' size=8 bits_offset=0 nr_bits=64 encoding=(none)

So let's consider that part "okay" (unless there is a bug in my code). Now, I would like to mention something else observed:

@rafaeldtinoco
Copy link
Contributor
rafaeldtinoco commented Nov 19, 2021

could you try to use the FULL BTF file (from BTFHub) in a 5.4.0 kernel and see if you're getting the "container_id" value ?

$ sudo TRACEE_BTF_FILE=/home/rafaeldtinoco/work/sources/ebpf/btfgen/btfs/5.4.0-87-generic.btf ./dist/tracee-ebpf --debug --trace container -trace event=openat,openat2
OSInfo: VERSION: "18.04.6 LTS (Bionic Beaver)"
OSInfo: ID: ubuntu
OSInfo: ID_LIKE: debian
OSInfo: PRETTY_NAME: "Ubuntu 18.04.6 LTS"
OSInfo: VERSION_ID: "18.04"
OSInfo: VERSION_CODENAME: bionic
OSInfo: KERNEL_RELEASE: 5.4.0-87-generic
BTF: bpfenv = false, btfenv = true, vmlinux = false
BPF: using embedded BPF object
BTF: using BTF file from environment: /home/rafaeldtinoco/work/sources/ebpf/btfgen/btfs/5.4.0-87-generic.btf
unpacked CO:RE bpf object file into memory
TIME             CONTAINER_ID  UID    COMM             PID/host        TID/host        RET              EVENT                ARGS
07:26:09:565417                0      bash             1      /2019    1      /2019    3                openat               dirfd: -100, pathname: /etc/ld.so.cache, flags: O_RDONLY|O_CLOEXEC, mode: 0
07:26:09:565483                0      bash             1      /2019    1      /2019    3                openat               dirfd: -100, pathname: /lib/x86_64-linux-gnu/libtinfo.so.6, flags: O_RDONLY|O_CLOEXEC, mode: 0
07:26:09:565612                0      bash             1      /2019    1      /2019    3                openat               dirfd: -100, pathname: /lib/x86_64-linux-gnu/libc.so.6, flags: O_RDONLY|O_CLOEXEC, mode: 0
07:26:09:566305                0      bash             1      /2019    1      /2019    3                openat               dirfd: -100, pathname: /dev/tty, flags: O_RDWR|O_NONBLOCK, mode: 0
07:26:09:566897                0      bash             1      /2019    1      /2019    3                openat               dirfd: -100, pathname: /etc/nsswitch.conf, flags: O_RDONLY|O_CLOEXEC, mode: 0
07:26:09:567006                0      bash             1      /2019    1      /2019    3                openat               dirfd: -100, pathname: /etc/passwd, flags: O_RDONLY|O_CLOEXEC, mode: 0
07:26:09:567519                0      bash             1      /2019    1      /2019    3                openat               dirfd: -100, pathname: /etc/bash.bashrc, flags: O_RDONLY, mode: 0
07:26:09:569169                0      groups           9      /2079    9      /2079    3                openat               dirfd: -100, pathname: /etc/ld.so.cache, flags: O_RDONLY|O_CLOEXEC, mode: 0

doing a bpf_printk() i was able to get the following:

            runc-2675    [003] d...  2745.657214: 0: id = 4294967408
            runc-2675    [003] d...  2745.659337: 0: id = 4294967835
            runc-2675    [003] d...  2745.659748: 0: id = 4294969175
            runc-2675    [003] d...  2745.660197: 0: id = 4294968570
            runc-2675    [003] d...  2745.660759: 0: id = 4294967896
            runc-2675    [003] d...  2745.661054: 0: id = 4294968237
            runc-2675    [003] d...  2745.661345: 0: id = 4294967377
            runc-2675    [003] d...  2745.661690: 0: id = 4294967347
            runc-2675    [003] d...  2745.662122: 0: id = 4294967329
            runc-2675    [003] d...  2745.662458: 0: id = 4294967344
            runc-2675    [003] d...  2745.662700: 0: id = 4294967684

but the container_id is zeroed.

@rafaeldtinoco
Copy link
Contributor

I'm able to get the container id if using the full BTF file for 5.13 kernel, for example:

$ sudo TRACEE_BTF_FILE=/home/rafaeldtinoco/work/sources/ebpf/btfgen/btfs/5.13.0-20-generic.btf ./dist/tracee-ebpf --debug --trace container -trace event=openat,openat2
OSInfo: KERNEL_RELEASE: 5.13.0-20-generic
OSInfo: PRETTY_NAME: "Ubuntu 21.10"
OSInfo: VERSION_ID: "21.10"
OSInfo: VERSION: "21.10 (Impish Indri)"
OSInfo: VERSION_CODENAME: impish
OSInfo: ID: ubuntu
OSInfo: ID_LIKE: debian
BTF: bpfenv = false, btfenv = true, vmlinux = true
BPF: using embedded BPF object
BTF: using BTF file from environment: /home/rafaeldtinoco/work/sources/ebpf/btfgen/btfs/5.13.0-20-generic.btf
unpacked CO:RE bpf object file into memory
TIME             CONTAINER_ID  UID    COMM             PID/host        TID/host        RET              EVENT                ARGS
04:52:27:510010  5040962671ac  0      bash             1      /3966413 1      /3966413 3                openat               dirfd: -100, pathname: /etc/ld.so.cache, flags: O_RDONLY|O_CLOEXEC, mode: 0
04:52:27:510301  5040962671ac  0      bash             1      /3966413 1      /3966413 3                openat               dirfd: -100, pathname: /lib/x86_64-linux-gnu/libtinfo.so.6, flags: O_RDONLY|O_CLOEXEC, mode: 0
04:52:27:510980  5040962671ac  0      bash             1      /3966413 1      /3966413 3                openat               dirfd: -100, pathname: /lib/x86_64-linux-gnu/libc.so.6, flags: O_RDONLY|O_CLOEXEC, mode: 0
04:52:27:516075  5040962671ac  0      bash             1      /3966413 1      /3966413 3                openat               dirfd: -100, pathname: /dev/tty, flags: O_RDWR|O_NONBLOCK, mode: 0
04:52:27:521596  5040962671ac  0      bash             1      /3966413 1      /3966413 3                openat               dirfd: -100, pathname: /etc/nsswitch.conf, flags: O_RDONLY|O_CLOEXEC, mode: 0
04:52:27:521795  5040962671ac  0      bash             1      /3966413 1      /3966413 3                openat               dirfd: -100, pathname: /etc/passwd, flags: O_RDONLY|O_CLOEXEC, mode: 0

@yanivagman
Copy link
Collaborator Author
yanivagman commented Nov 21, 2021

Thanks @rafaeldtinoco for pointing me to this problem!
It turns out that my solution doesn't handle cgroup v1 properly.
Ubuntu 18.04.6 uses cgroup v1 by default, so you encountered this problem. Luckily for us, it is easy to check if cgroup v2 is enabled or not in the system: https://docs.docker.com/config/containers/runmetrics/#enumerate-cgroups

The problem is that bpf_get_current_cgroup_id() helper returns cgroup v2 id, which doesn't match any of the cgroup ids found while walking cgroupfs (v1).

The solution I have in mind for this:

  1. On tracee init - check if cgroup v2 is enabled or not (by checking for the existence of /sys/fs/cgroup/cgroup.controllers)
  2. Init bpf config map with cgroup v1/v2 enabled flag
  3. If cgroup v2 - use current logic
  4. Otherwise, for each event context, get cgroup id of subsystem 0 (cpuset) (task->cgroups->subsys[0]->cgroup->kn->id)
  5. As we use a specific cgroup subsystem (cpuset), only parse cpuset folders when iterating cgroup v1 on init and when cgroup_mkdir is received

WDYT?

@yanivagman
Copy link
Collaborator Author
yanivagman commented Nov 21, 2021

One more point that we should add to the docs:
If the user wants a correct enumeration of existing containers (whether cgroup v1/v2), he should run tracee in the host cgroup namespace (cgroupns=host in docker).
This point is also true today, when we walk cgroupfs on init.

@yanivagman yanivagman force-pushed the cgroup_to_container_id branch from a8c57dd to 80b8c63 Compare November 21, 2021 15:30
@yanivagman yanivagman force-pushed the cgroup_to_container_id branch from 80b8c63 to d850e35 Compare November 21, 2021 15:40
@yanivagman
Copy link
Collaborator Author

Note about my recent optimization in the last commit (cgroup v1 XOR v2, not both), taken from https://medium.com/nttlabs/cgroup-v2-596d035be4d7 : "cgroup v1 and v2 are incompatible and can’t be enabled simultaneously. Although there is “hybrid” configuration that allows mounting both v1 hierarchy and v2 hierarchy, the “hybrid” mode is underutilized for containers because you can’t enable v2 controllers that are already enabled for v1."

@rafaeldtinoco
Copy link
Contributor

Thanks @yanivagman, I'm reviewing this today and will let you know.

Copy link
Contributor
@rafaeldtinoco rafaeldtinoco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, it seems to be working now. I agree to the rational discussed about cgroups v1 and this commits looks good to me. A SPECIAL ATTENTION to this commit is that we're starting to use type-based checks (which will require btfgen to support type-based relocations).

I'll provide the changes at:

kinvolk/btfgen#13

@rafaeldtinoco
Copy link
Contributor

5. As we use a specific cgroup subsystem (cpuset), only parse cpuset folders when iterating cgroup v1 on init and when cgroup_mkdir is received

Yes, I think that is the secret. To rely only in a single cgroup subset and, yep, seems like cpuset is the one that should be taken.

@rafaeldtinoco
Copy link
Contributor

Sorry I accidentally closed the PR. Re-opened it.

@yanivagman yanivagman merged commit d421bb9 into aquasecurity:main Nov 22, 2021
@yanivagman yanivagman deleted the cgroup_to_container_id branch November 22, 2021 17:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

container_id initialization regex has to be fixed Use cgroup id for containers
2 participants
0