[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
|
|
Subscribe / Log in / New account

A report from the 2024 Image-Based Linux Summit

October 22, 2024

This article was contributed by Luca Boccassi

The Image-Based Linux Summit has by now established itself as a yearly event. Following on from last year's edition, the third edition was held in Berlin on September 24, the day before All Systems Go! 2024 (ASG). The purpose of this event is to gather stakeholders from various engineering groups and hold friendly but lively discussions around the topic of image-based Linux — that is, Linux distributions based around immutable images, instead of mutable root filesystems.

The format of the event consists of a series of BoF sessions held in sequence, on topics chosen by the attendees. Organizers Luca Boccassi and Lennart Poettering welcomed participants from the Linux Userspace API (UAPI) Group, who work for companies or on projects such as Microsoft, Canonical/Ubuntu Core, Debian, GNOME OS, Fedora, Red Hat, SUSE, Arch Linux, mkosi, Flatcar, NixOS, carbonOS, postmarketOS, Pengutronix, and Edgeless Systems.

Progress since the previous summit

The first order of business was letting participants summarize what they achieved on topics of interest since last year's summit. The UAPI Group's web site and GitHub organization have had more specifications added, including one that precisely defines the pattern to use for configuration file handling on a hermetic-usr system. That specification formalizes what projects such as systemd and libeconf already use.

Systemd

Perhaps unsurprisingly, the systemd project implemented a lot of new features in the two major releases that went out since the previous summit. One of the most important pieces of work was the implementation of the systemd-pcrlock tool, which aims to solve a major gap in the measured-boot story, namely how to deal with inherently local platform-configuration registers (PCRs) that are not under the control of the OS vendor. Poettering presented at ASG on this topic the following day (the video is available here). Once this tool is refined and ready for production use, it should push the measured-boot story on Linux much closer to completion. There were many other changes of course, and a "State of the Project" talk at ASG attempted to provide an overview.

Mkosi also saw several major updates, and can now run fully unprivileged. It has also dropped the use of bubblewrap to provide a mkosi-sandbox tool instead. Support for OpenSSL engines and providers was added to sign artifacts, and mkosi-initrd is now fully integrated to allow building initrds from packages for use in unified kernel images (UKIs). It can now produce artifacts that can be directly consumed by systemd-sysupdate. Support for new distributions, such as Azure Linux, was added.

Distributions

Distribution vendors and maintainers have been busy too. Flatcar has now adopted System Extensions (sysexts) as a way to extend its production deployments and to simplify user customisation. It also integrates systemd-sysupdate as a complementary service to let operators update custom extensions at their own pace. NixOS has fully integrated systemd-repart into its build system, including support for having a dm-verity signed /nix/store, and UKIs are available by default. Edgeless is trying to make progress on a proposal to write a shared specification for package-manager lockfiles that can be shared across multiple projects. The company is also still working on its Uplosi tool for uploading images to cloud providers.

OpenSUSE has implemented full disk encryption bound to the TPM using signed policies and pcrlock, added support for soft-reboot using the Btrfs-based transactional-updates mechanism, and provides systemd-boot as an option in the image installer. GNOME OS has made significant investments to improve and integrate systemd-homed and systemd-sysupdate, thanks to a grant from the Sovereign Tech Fund, and also started using sysexts for testing system components during development as part of its continuous-integration system. Red Hat made progress on the automotive use case by supporting dm-verity for the base image in the osbuild image-building tool. It is also working on the bootc project to make bootable container images.

Linux Plumbers Conference

The week before the summit, the 2024 Linux Plumbers Conference was held in Vienna, and many UAPI group members participated. They also organized one of the microconferences, the Kernel ↔ Userspace/Init/System Management boundaries and APIs MC. The experience was positive and the event was productive, with many topics of interest covered. One notable topic was a discussion about how to refactor the kernel's handling of initrd, with the end goal of being able to enforce an immutable, read-only initrd at run time, rather than the unpacked tmpfs that is currently used. This would avoid copying and the need to delete contents before the transition to the real root filesystem.

Kernel

A relevant update on the kernel side was the Integrity Policy Enforcement Linux Security Module (IPE LSM) being accepted for inclusion upstream during the 6.12 merge window. This new LSM lets image-based Linux deployments ship a code-integrity policy enforced by the kernel, so that only signed (and thus trusted) payloads can be executed at run time. Enabling this feature was always one of the goals of developing image-based Linux products, and a demo showing how this can work was given at ASG.

Dual-boot and the discoverable-partitions specification

After the updates were given, the participants discussed the compatibility of the discoverable-partitions specification (DPS) with dual booting and operating systems' ownership of their respective partitions. A new installer for GNOME OS has been introduced that no longer supports traditional /etc/fstab configurations. This change has raised questions about identifying which root and /usr partitions belong to which distribution.

To address this, the idea is that the root-filesystem discovery will be based on pattern matching on labels, though this feature has not yet been fully incorporated into the specifications. Corresponding /var partitions are identified through hashing of the machine ID, which is problematic when building images with mkosi as the ID would need to be fixed at build time, which is the opposite of how it is supposed to be used. This limitation prompted questions about production adoption of DPS; for instance, SteamOS has not yet integrated it due to issues with discovering the complete partition set.

Proposals were made to enhance partition identification through label-based matching and filtering, ensuring backward compatibility with systems that do not use labels. The need to support multiple versions of the same OS (as opposed to just different OSes) was also noted, along with potential solutions for specifying root filesystems in configurations using UKIs and systemd credentials locked to the TPM.

Stateless OpenPGP verification

The next discussion focused on establishing a generic pattern to use Stateless OpenPGP for the verification of distribution artifacts, including repositories and packages. Participants identified numerous pitfalls associated with the current use of GnuPG, particularly its non-compliance with the latest standard and the statefulness of keyrings. APT, the package manager used by Debian, Ubuntu, and derivative distributions, currently supports a directory hierarchy under /etc/apt/trusted.gpg.d for OpenPGP keyring files. A similar but generalized scheme that could be adopted by various producers and consumers of keys would be ideal.

A proposal was made to explore additional technologies, such as PKCS #7, allowing for greater flexibility in how keys are managed and used across distributions. This would facilitate better integration with systemd for artifact authentication during downloads. The discussions emphasized the importance of establishing a clear specification for key management, ensuring that keys have designated purposes and are stored in a structured directory hierarchy, following the common /etc -> /run -> /usr pattern for discovery. The directory structure would indicate what the key is for (e.g. APT or systemd-sysupdate), but the policy defining how the key can be used (to sign packages, etc.) should be inside the key itself. The design of such a specification is currently in progress in the UAPI Group repository.

Kernel-enforced restrictions for unsigned filesystems

The need for robust security features in systemd was underscored, particularly regarding access to unauthenticated filesystems. Proposals included the implementation of a BPF LSM program to reject access to unauthenticated filesystems (so only authenticated filesystems, such as those protected with dm-verity and dm-crypt, or kernel-provided virtual filesystems such as procfs and sysfs would be permitted) and deny access to device nodes outside of /dev. Additionally, the rejection of AF_UNIX sockets in unexpected locations like /etc and /usr was proposed.

The community was encouraged to submit requests for enhancements to track these proposed security policies. Enhancing the visibility of loaded programs in the BPF filesystem was also discussed, which would aid in managing filesystem policies more effectively. As an alternative, the recently merged IPE LSM could be enhanced to provide such controls, and, in fact, it already does provide such a feature at a proof-of-concept stage inside Azure Boost.

Combining FIDO2 and TPM2 for authentication

The relationship between FIDO2 and TPM2 technologies was a significant point of discussion. Participants explored the potential of combining TPM2 and FIDO2 as a two-factor authentication mechanism.

A TPM2 policy can enforce that a challenge-response type of authentication takes place before a secret can be unlocked. This could be used to send the challenge to a FIDO2 device, but it should also work with a PKCS#11 hardware security module. This is the scheme that ChromeOS already supports, so it appears to be a viable option.

A scheme based on Shamir's secret sharing was also discussed, and an implementation had even started to take form. The main downside compared to the previous option is that combining the key shards has to happen in main memory and be implemented by the CPU, while the other scheme lets the security chips handle this, which is safer.

Challenges of immutable systems and added complexity for contributors

How to deliver an immutable system without raising complexity for contributors was another point of discussion. The challenges of building images for postmarketOS, especially locally, were highlighted.

Plans are in motion for loading sysexts early in the boot process (while still in the initrd phase), so that they can be applied immediately on the rootfs. The idea of having a local writable layer that could be "committed" to a sysext was also floated. Another option could be to perform full image builds and sign them with local keys, with a fast reboot mechanism (provided by soft-reboot).

While this bypasses some security models, of course, it may serve as a way to let developers use the systems they are building while allowing for shorter development cycles, which are fundamental for productivity.

ChromeOS and NixOS both provide a "developer mode" where security requirements are relaxed, to allow for such workflows without impacting the security of production deployments. This is sometimes called a "break glass" mode. The GNOME OS developers were looking into providing such a feature, but there was interest in implementing this directly in systemd, instead, so that it can be integrated with the TPM.

Systemd on musl

The adaptation of systemd for use with musl libc garnered attention, particularly in the context of postmarketOS. The challenges faced by contributors were discussed, highlighting the need for collaboration to address the technical hurdles involved in porting systemd to this environment.

The current plan of record is for the postmarketOS developers to provide a shim library that implements the APIs missing from musl that are needed by systemd, such as pidfd_spawn(), gshadow(), and additional printf() formatters capabilities. These features are closely tied to the libc and should really be implemented by libc authors.

Discussions also touched on the need for better management of /etc as a writable configuration context. Suggestions included persisting the machine ID and exploring solutions for managing presets, such as mounting /var from initrd.

The complexities of overlay filesystems and their interaction with writable configurations were explored, with participants suggesting that early mounting of /var during the boot process could mitigate some of these issues.

The /etc dilemma

As one of the most often-recurring topics in this area, it would have been strange if it hadn't been discussed. The question of how to handle /etc on immutable systems is one that has many possible answers, some more complete than others.

Even on a fully immutable system, some files, like the machine ID, are inherently local to a specific installation and cannot be part of the rootfs. These should remain stable across reboots so they cannot be ephemeral either. There are ad-hoc solutions, like setting systemd.machine_id=firmware when booting a VM so that a machine ID can be generated from a VM UUID set by the hypervisor. Another proposed solution for physical machines could use the TPM or a sealed system credential to persist the machine, instantiated on first boot. But such an approach cannot scale, naturally.

The main issue is that any updated /etc files need to be visible from the beginning of the rootfs boot phase, but the most common solutions to mount data partitions such as /var do it as part of the same boot phase, so any files stored in /var and symbolically linked, bind-mounted, or otherwise made available on the rest of the system, are not visible from the beginning. A proposed solution to this problem would be to ensure /var is mounted already by the early boot process, before switching root. In fact, SUSE MicroOS already prepares its /etc overlay in the initrd, so there is a working precedent for such a setup. It might be time for systemd to take care of this issue and generalize the move to mounting /var in the initrd, so that the various OSes and distributions can employ their preferred mechanism to update files in /etc, be that via confexts, snapshots, or overlays. Another workaround discussed involves using confexts in mutable mode, and either writing changes to /etc directly or redirecting writes to a staging directory to generate a new confext from.

Progress on hermetic /usr

Being closely related to the /etc dilemma, the efforts to push forward the hermetic /usr concept were also discussed at length. While significant progress has been made, challenges remain due to a small number of projects' resistance to the proposed changes. On a minimal base system, some configuration files that the GNU C library (glibc) uses (/etc/services, ldconfig, and nsswitch) are the last remaining items to address; the glibc maintainers are amenable to accepting patches if someone were to work on them and there are plans to make this happen.

Outside of such a minimal setup, there are various strategies to deal with the lack of support for default configuration for other programs in /usr, which is less problematic as it tends to be a late-boot problem. A common solution is to ship /usr/share/factory/etc and create symbolic links via tmpfiles.d to link the configuration files into /etc. Another solution is to use overlayfs to layer configuration storage directories in /usr or /var on top of /etc, which is what SUSE MicroOS and Flatcar do.

Unprivileged image mounting and user ranges assignment

Systemd recently introduced the mountfsd and nsresourced services that allow unprivileged users to mount verified images and to request user namespaces with pre-mapped UID/GID ranges. Previously, this had to be done via tools like newuidmap that use setuid, but it is known this approach is prone to security problems, since the caller controls the execution environment. Nsresourced is an interprocess-communication service (using varlink), so its execution environment is set up by systemd, like any other system service.

Work on these components is not done yet, though, and some challenges remain, such as how to assign ranges for different use cases, especially without knowing in advance what will be deployed. Dynamic assignment is problematic due to clashes, and having to manually configure the assignments is cumbersome. The proposed solution is to assign a predefined range of UIDs/GIDs that all containers will use. Since they are static and pre-defined, one doesn't need to know in advance what the situation on the system where the container will be deployed is, greatly simplifying setups. All dynamic ranges will get mapped to this predefined range.

One of the remaining issues is that nesting is not possible, although it seems that work is planned to solve this problem in the kernel.

Another issue is that, given that users do not own the files on the filesystem using this static range, mountfsd will need to gain the ability to clean them up. This seems like a solvable problem with a new API designed for this purpose. Likewise, only images are handled now, and mountfsd should be enhanced to also be able to mount directories for users. Compared to the problem of getting buy-in from various projects to the idea of using the static, fixed range to build images, these technical challenges seem easy.

ESP resizing

UKIs require more storage space than the EFI system partition (ESP) was originally planned to provide, so many existing installations are not large enough, especially once addons and extensions are factored in. The boot loader specification introduced the extended boot loader partition for this reason, so that existing systems can gain additional storage space without having to reformat their drives.

But sometimes this is not enough either, and there is a strong desire to be able to dynamically extend the ESP. The problem is that there is nothing that can resize a VFAT filesystem in place, so this problem comes up often for discussion. Android was the next topic discussed. It ran into a similar issue with its OS partitions and solved it by concatenating partitions at the kernel level using dm-linear. An idea was proposed to implement something similar using a special GPT partition type and algorithm for deriving partition UUIDs. But, so far, nobody has stepped up to attempt to implement this strategy, nor would this help with the ESP, so a solution to this problem remains elusive for now.

Factory reset

Factory reset is implemented in user space, with a special target that services can be hooked into and that can be booted to. Systemd-repart also has support for deleting data partitions and recreating them. But this is only part of the picture, as nowadays there will be data on the ESP too, in the form of credentials, addons, extensions, and self-signed images, so a strategy to deal with those is also needed.

Managing the ESP is tricky as it could be shared among multiple OSes, and it might store vendor data that might be necessary to boot the machine, which should not be deleted. The agreed solution is to come up with a separate "vendor" directory for addons, extensions, and other artifacts that will never be removed on factory reset.

The TPM should also be reset, and fortunately an API already exists that can be called to queue such an operation for the next reboot. Integration in user space is required, but should be fairly straightforward.

And speaking of integration, a way to tie all of these mechanisms together is still needed. A proposal was made to allow users to request a factory reset directly from the boot menu. This reset process would trigger comprehensive system resets, including TPM resets and systemd-repart's factory-reset functionalities, and this should fill all of the gaps in the current implementation.

Customizing the boot process via credentials instead of the kernel command line

Projects implementing immutable systems largely rely on the boot loader to show options to users, to let them pick the desired snapshot, generation, or image to boot. The kernel command line is used as the medium to pass this information to the services in the initrd that set these systems up.

The problem is that the kernel command line is a kitchen sink; it is parsed by anything and everything, and used for diverse things, with no separation or namespacing. And, of course, it is also parsed and used by the kernel. An unprivileged user gaining access to the kernel command line could have catastrophic consequences for a system. The kernel even parses it before ExitBootServices has been called, so even the firmware is part of the attack surface.

The proposed solution is to switch to systemd credentials instead. These are scoped, individual, and targeted, so only the user-space service that needs a credential will receive it. And, of course, the kernel does not parse these credentials, so the attack surface is greatly diminished. There are two issues with this approach: first of all, the tooling is not up to scratch yet, and there is no GUI for selecting a credential or a subset of credentials to apply to a system when booting. Secondly, user-space programs have largely not yet been enhanced to use them.

The first problem appears to be more difficult, as implementing a usable and friendly GUI in the bootloader is no easy task, especially for one that is able to display a large matrix of possible choices in a way that is usable. The second problem is technically simpler, as systemd makes it really simple to opt in and use a credential, but requires more work to convince projects to adopt credentials. Having a fully implemented end-to-end story for credentials will probably be required before more projects take the plunge and adopt them as an alternative to the kernel command line for configuration.

Conclusions

The day concluded as planned, with all participants agreeing it was productive and that work should continue on the UAPI Group and ancillary projects, and that the event should be repeated next year. The next immediate goal will be preparing for the Image-Based Linux devroom at FOSDEM 2025, hoping to repeat the success of the 2023 edition. The full minutes of the summit have been published on the UAPI Group web site. Pushing image-based Linux projects forward is not a single set of tasks but an ongoing process, one that requires participation and coordination from many projects, companies, and groups, and the Image-Based Linux Summit is the ideal forum for such activities.


Index entries for this article
GuestArticlesBoccassi, Luca


to post comments


Copyright © 2024, Eklektix, Inc.
Comments and public postings are copyrighted by their creators.
Linux is a registered trademark of Linus Torvalds