NVIDIA and nouveau

By Jake Edge
October 5, 2022

The release of source code for NVIDIA graphics hardware was perhaps something of a surprise; at least at a quick glance, it seems like that could lead to an in-tree, officially supported driver. For many years, though, the nouveau project has been working on an upstream driver for NVIDIA hardware, so an obvious question is what happens with nouveau in light of the NVIDIA announcement. Kernel graphics maintainer Dave Airlie gave a talk at the 2022 Linux Plumbers Conference (LPC) to help shed some light on that subject.

NVIDIA

He began by giving a brief history of NVIDIA hardware, with a timeline that can be seen in his slides. The timeline was in part "cobbled together from Wikipedia" and is not completely accurate, he said, but shows just "how far back NVIDIA hardware stretches". While the timeline starts in 1999, things started getting interesting in 2006 with the NV50, he said. It introduced the per-context virtual memory addresses; that feature represented a major turning point for graphics hardware.

There is roughly a two-year cadence to the NVIDIA releases starting in 2010 with "Fermi" (GF1xx). Vulkan support was added in the "Kepler" (GK1xx) hardware in 2012. In 2014, "Maxwell" came in two versions (GM1xx and GM2xx); the latter, also known as "Maxwell 2", introduced signed firmware. That pace more or less continued with "Pascal" (GP1xx) in 2016, "Volta" (GV1xx) in 2017, "Turing" (TU1xx) in 2018, and "Ampere" (GA1xx) in 2020. Turing brought support for the GPU system processor (GSP); he explained the importance of that feature a bit later in the talk.

Starting with Maxwell 2, NVIDIA decided that firmware for its devices could not simply be loaded unsigned, for security and other reasons. So firmware needed to be signed by NVIDIA and loaded into the multiple processors on the device. This made life hard for the nouveau project because it required complicated boot sequences for poking multiple firmware images into the device in a specific order that was "very hard to get right".

NVIDIA and nouveau had worked out an arrangement where NVIDIA would provide signed firmware, but it was still difficult to get any of the hardware working. Even when all of the right things were done at boot time, the devices came up in their base configuration. The devices were powered-on and functioning, but "you can't make it reclock, you can't make it go faster". Manually choosing a performance level for NVIDIA devices is known as "reclocking". There was also no power-management functionality available to the driver. This was a watershed moment for nouveau, Airlie said, because it did not make sense to put a lot of effort into a driver for graphics hardware running in its slowest possible mode, while not being battery-friendly either.

The GSP is a RISC-V-based processor that was added to the GPU for the Turing and later hardware. The GPU already had "six or seven little processors on it", but the GSP is meant to be "the one to rule them all". The firmware file for the GSP is around 30 or 40MB; most of the earlier firmware blobs were on the order 256KB, so the GSP is a substantial increase in size. But it is a single firmware image for the device that initializes the rest of the processors. Effectively, NVIDIA moved much of its proprietary kernel driver into the GSP.

That all happened around the same time as the announcement of the open-source NVIDIA kernel drivers, he said. Those are based on a fork of the NVIDIA proprietary driver that only interfaces directly to the GSP; it turns out that there is nothing all that interesting in the API between the kernel and the GSP, so it could be released as open source. Since NVIDIA has customers who are interested in open-source drivers, it makes sense for the company to do so. However, the drivers do not look or act like the existing kernel graphics drivers so they are not able to go into the upstream kernel, Airlie said.

nouveau

That is the current state in the NVIDIA world, which made for a good lead-in to talk about nouveau. That project started in around 2007 to reverse-engineer NVIDIA GPUs to create Linux drivers. It supports hardware from NV04 (1999) through Ampere "in various states of disrepair".

But the project has stagnated some recently due to various factors. One big problem that a community open-source graphics project faces is that once someone gets good at working on it, that becomes known, and they get hired away to work on some other graphics hardware. There is really only one full-time nouveau developer, Ben Skeggs at Red Hat, working on the project.

Also, once the signed firmware came about, with its lack of reclocking and power-management features, it was disheartening for the project; there was no way that the open-source driver was ever going to be able to compete with the proprietary one. It was hard to justify putting in a lot of effort into nouveau. Beyond that, Skeggs spent a lot of time just trying to get the firmware provided by NVIDIA to load and run the hardware.

For the most part, the nouveau kernel driver is just for hardware enablement at this point. The firmware that NVIDIA provides is not the same as what is used by the proprietary driver, so it is not well-tested. Only NVIDIA can really debug problems with that firmware, so there have to be multiple round-trips with NVIDIA engineers. More recently, though, the project has been adding GSP support because that provides a high-level interface to things like reclocking, so the hope is that the nouveau kernel driver can use the standard NVIDIA GSP firmware and drive the hardware that way; "we will see".

OpenGL and Vulkan

There is a nouveau OpenGL driver in Mesa. He believes it has passed the OpenGL 4.5 conformance tests, but has never been submitted for certification. Up until recently, it had "horribly broken multithreading context support" so it worked for older single-threaded games and the like but not for programs like Firefox or modern games; that has been fixed recently, though. The driver has not seen a lot of optimization work, however, due to the lack of reclocking support for the hardware.

A Vulkan driver for nouveau was recently started by Jason Ekstrand, with help from Karol Herbst and Airlie. At the time of the talk, that was a bit of news, but things have progressed since that time. The driver is targeting Vulkan 1.0 for hardware from Kepler up through Ampere and is passing lots of the conformance tests at this point. But in order to finish the driver, and make it work the way they want it to, there is a need to add new user-space APIs to the kernel.

There are three features needed to get Vulkan really working, he said. The first is to split the physical memory allocations (for buffer objects) from the GPU virtual memory allocations. In nouveau, that's all done in one step, which is fine for OpenGL but does not work for Vulkan where more control over the mappings is required.

The second is that synchronization objects and ways to handle and work with them need to be added so that the scheduler can wait for existing GPU work to complete before sending new tasks. It is the way to do proper interleaving of GPU work, Airlie said. The final piece is a virtual-memory-handling interface that is called VM_BIND; it is something that is being looked at for the Intel driver and the amdgpu driver already has many of pieces of it. It is an API both for virtual memory and for command submission that is also geared toward the needs of Vulkan.

Those are all non-trivial projects, he said. Once the GSP support is working and reclocking can be done, these are the next steps for nouveau, but they are going to take some time. The Vulkan driver developers have already started looking at that effort, but there are somewhat circular dependencies that make it difficult to see how to do the work incrementally. It will be a lot of code to review, so getting it into the upstream kernel in a piece-wise fashion will be challenging.

Future

There are some upcoming problems that have not yet been faced, he said. A 30 or 40MB firmware image is rather large; normally, those are put into the initramfs. But putting multiple initramfs images into the boot partition may overrun the space available. The problem gets worse because there may be a need to ship multiple NVIDIA firmware images due to a lack of a stable firmware ABI. The nouveau project will have to pick and choose which firmware versions to support, but each will need to be available; he has wondered if there might be a way to delay firmware loading until after the real root filesystem is mounted, but has not really worked that out yet.

In the long run, it may not really make sense to pound the NVIDIA firmware API into the nouveau driver, which has its own ideas about how everything works. A new driver that leaves behind the existing nouveau legacy and only talks to the GSP using NVIDIA's API may be the right path instead. In addition, the ability to reclock the hardware and accelerate the GPU may allow creating a cross-platform compute stack, to replace the vendor-specific solutions (e.g. CUDA) that exist today. All of those solutions are on their own island, lacking any real developer community, but maybe that could be changed; "we've done it for Vulkan, we've done it for OpenGL, I don't see why we can't do it", though it will take a lot of time—and likely a lot of money.

An audience member asked about Vulkan Compute as a possibility, but Airlie said it was not geared toward the same kinds of problems as CUDA and others. It is better than OpenGL Compute, but is still a long way from what the real compute stacks provide. Ekstrand echoed that, noting that while there is desire to see Vulkan Compute handle more of the "scientific" computing use cases, it will never be a full-stack solution; at most Vulkan can provide the run-time piece, he said.

There was some discussion of the problem with the size of the GSP firmware and initramfs that Airlie had described, including several suggestions of ways to approach the problem. The YouTube video of the talk is available for those who are interested in that discussion or more of the details elsewhere in the talk.

[I would like to thank LWN subscribers for supporting my travel to Dublin for Linux Plumbers Conference.]

Index entries for this article
Conference	Linux Plumbers Conference/2022

NVIDIA and nouveau

Posted Oct 6, 2022 2:58 UTC (Thu) by dowdle (subscriber, #659) [Link]

Video available here: https://youtu.be/KkOdMwZRpYY?t=30721

NVIDIA and nouveau

Posted Oct 6, 2022 8:31 UTC (Thu) by marcH (subscriber, #57642) [Link] (1 responses)

> he has wondered if there might be a way to delay firmware loading until after the real root filesystem is mounted, but has not really worked that out yet.

Discussed in 2016:
https://lkml.iu.edu/hypermail/linux/kernel/1609.0/01530.html

https://chromium-review.googlesource.com/c/chromiumos/third_party/kernel/+/354089/3#message-6133ed6092de59e5a21cb25df03bdf1278a724d1

NVIDIA and nouveau

Posted Oct 10, 2022 9:05 UTC (Mon) by tiwai (subscriber, #39450) [Link]

Interesting that such a problem hasn't been solved yet.

I guess the easiest solution for now is to re-use the existing UMH of the firmware loader; basically you just need to enable CONFIG_FW_LOADER_USER_HELPER (but turn off CONFIG_FW_LOADER_USER_HELPER_FALLBACK), and modify / create a new helper to call request_firmware*() with FW_OPT_USERHELPER flag to allow the fallback via sysfs interface if the target firmware isn't found in initrd.
Then you can set up user-space as you like to pass the firmware at any feasible moment with a classic mechanism via sysfs. Systems without an extra setup would keep working as long as a target firmware is put in initrd, too (just like now).

NVIDIA and nouveau

Posted Oct 6, 2022 10:47 UTC (Thu) by gb (subscriber, #58328) [Link] (4 responses)

I can't believe that this firmware images are radically different, so there might/should be a way to compress them efficiently.

NVIDIA and nouveau

Posted Oct 6, 2022 11:30 UTC (Thu) by eru (subscriber, #2753) [Link] (3 responses)

The firmware images might already be compressed, in which case you cannot make them any smaller.

NVIDIA and nouveau

Posted Oct 6, 2022 13:23 UTC (Thu) by flussence (guest, #85566) [Link] (2 responses)

If they're encrypted as well as signed, we can forget about compressing them too.

I hope they don't end up part of the standard kernel firmware tarball. That's bloated enough as it is.

NVIDIA and nouveau

Posted Oct 7, 2022 1:17 UTC (Fri) by gb (subscriber, #58328) [Link] (1 responses)

All true, but nvidia itself should be interested in making smaller images.
So instead of providing N full signed blobs it could provide 1 Big blob and N-1 delta blobs, and load two files into video card, and hardware deal with delta, compression and encryption.
So it should be safe and small in size.

NVIDIA and nouveau

Posted Oct 7, 2022 18:24 UTC (Fri) by zdzichu (subscriber, #17118) [Link]

Why would nvidia care about size? Current setup works for them.

NVIDIA and nouveau

Posted Oct 6, 2022 14:23 UTC (Thu) by mcon147 (subscriber, #56569) [Link] (11 responses)

Is nvidia unable to have a 'default' firmware on the device that it self loads without the host-system getting involved?

NVIDIA and nouveau

Posted Oct 6, 2022 16:38 UTC (Thu) by JoeBuck (subscriber, #2330) [Link] (9 responses)

It would be possible, but I don't see why they'd want to do that. The firmware is a highly complex piece of software; it will have bugs; after a very short time no one should be running the firmware that was originally shipped with the device.

Please pardon me if I'm interpreting your question wrong, but if the idea here is to enable the FSF fiction that if we don't give the user any way to modify proprietary firmware, even to fix severe bugs, and ship devices that put the firmware in ROM or effectively-ROM (nonvolatile memory that the OS provides no way to write to), we can claim to be running an entirely free system, that idea isn't worth promoting.

NVIDIA and nouveau

Posted Oct 6, 2022 18:36 UTC (Thu) by iabervon (subscriber, #722) [Link]

I could see an argument for having on-board firmware that interacts compatibly with the kernel driver and supports everything you'd want before mounting your real root partition or while booting off a rescue disk or running an installer. Once normal userspace is set up, it's a lot easier to supply large proprietary files that sometimes need to be updated, and makes it not the kernel's problem.

NVIDIA and nouveau

Posted Oct 6, 2022 19:11 UTC (Thu) by ncm (guest, #165) [Link] (5 responses)

If its buggy on-board firmware can anyway succeed at initializing the hardware adequately before being supplanted by the runtime blob, that would suffice. Then the blob doesn't need to be in initramfs.

NVIDIA and nouveau

Posted Oct 11, 2022 6:58 UTC (Tue) by marcH (subscriber, #57642) [Link] (4 responses)

> If its buggy on-board firmware can anyway succeed at initializing the hardware adequately before being supplanted by the runtime blob, that would suffice. Then the blob doesn't need to be in initramfs.

Exactly this. BTW this was discussed a lot in the Q&A session at the end of the presentation, the URL is above.

IMHO the key idea is to stop considering the GPU (and some others) as some "ancillary" device that should be fully initialized as early and quickly as possible. The CPU and GPU should instead be treated more like _peers_ in the "Distributed System on Chip", trying to boot at the same time with as few as possible early dependencies between each other.

There is clearly another, "full-blown" operating system in those 40 Megabytes; some Linux products are smaller than that!

So the GPU should have its own, basic "bootloader" that makes the screen just _usable_; the equivalent of UEFI on the main CPU. In fact you bet NVidia engineers have stuff like this internally _already_ because they need "bootloader" and minimal systems like this when they screw up the big image and it stops booting - exactly like when you fall back to UEFI and GRUB when you screw up the OS of the main CPU.

"NVidia must release option ROMs" was mentioned in the Q&A session.

NVIDIA and nouveau

Posted Oct 11, 2022 14:48 UTC (Tue) by luto (subscriber, #39314) [Link] (3 responses)

I would imagine that the UEFI framebuffer works even before the OS loads. It would be interesting to learn how this happens and what state the cards are in.

NVIDIA and nouveau

Posted Nov 6, 2022 4:02 UTC (Sun) by marcH (subscriber, #57642) [Link] (2 responses)

UEFI implementations are not called "Operating Systems" because they lack things like interrupts and a scheduler but they can do 10 times more than what MSDOS (MS Disk _Operating System_) ever did.

NVIDIA and nouveau

Posted Nov 6, 2022 10:22 UTC (Sun) by Wol (subscriber, #4433) [Link] (1 responses)

Because MS-Dos never claimed to be an _Operating System_. It operated the disk. That was all it claimed, it didn't claim to operate the computer ...

Cheers,
Wol

NVIDIA and nouveau

Posted Nov 10, 2022 19:02 UTC (Thu) by flussence (guest, #85566) [Link]

Precisely - DOS sits at the same layer GRUB does, it's the liminal space between the MBR and what you turned your computer on for, not a useful application unto itself.

(Suddenly the EFI Shell being designed the way it is makes a lot more sense to me…)

NVIDIA and nouveau

Posted Oct 7, 2022 6:37 UTC (Fri) by himi (subscriber, #340) [Link] (1 responses)

> Please pardon me if I'm interpreting your question wrong

I suspect the point of the question was to allow bringing up the card with that (hopefully small) default firmware, and then later on load the big blob from storage somewhere other than the initramfs and complete the bring up. It seems like a sensible option, though I'm not sure NVidia would want to bother with the work required to support it.

NVIDIA and nouveau

Posted Oct 7, 2022 7:47 UTC (Fri) by mjg59 (subscriber, #23239) [Link]

One approach would "simply" be for the driver to inherit whatever state has been configured by the firmware, and then defer full local init until firmware is available. It doesn't handle cases where the card hasn't been initialised before Linux starts, but that seems like a pretty niche scenario for desktop-oriented distros.

NVIDIA and nouveau

Posted Oct 7, 2022 17:13 UTC (Fri) by ju3Ceemi (subscriber, #102464) [Link]

Yes

For CPU, you have multiple ways to push a firmware in it:
- either via some bios update, which will permanently load the firmware
- or at runtime, from Linux

I cannot see why this is not the same for GPUs .. you can upgrade your "gpu bios" already

NVIDIA and nouveau

Posted Oct 6, 2022 16:49 UTC (Thu) by rgb (subscriber, #57129) [Link]

Regarding the firmware images: Would maybe some kind of delta compression work to reduce the total disk space required for storing multiple versions?

NVIDIA and nouveau

Posted Oct 7, 2022 16:29 UTC (Fri) by wsy (subscriber, #121706) [Link] (5 responses)

The firmware itself may be a complete linux system image. I think that's a trend of hardware industry.

NVIDIA and nouveau

Posted Oct 7, 2022 22:36 UTC (Fri) by mathstuf (subscriber, #69389) [Link] (2 responses)

That would just be asking for some…interesting GPLv2 source code requests, would it not?

NVIDIA and nouveau

Posted Oct 8, 2022 18:47 UTC (Sat) by wsy (subscriber, #121706) [Link] (1 responses)

Then you get a huge outdated tar ball without build script or tool chain.

NVIDIA and nouveau

Posted Oct 9, 2022 11:37 UTC (Sun) by mathstuf (subscriber, #69389) [Link]

Well, that'd be a vast improvement at least. But I don't think Nvidia would be so careless to allow such a "cheap shot" at their "crown jewels".

NVIDIA and nouveau

Posted Oct 11, 2022 10:37 UTC (Tue) by xnox (guest, #63320) [Link] (1 responses)

Upon basic inspection it appears to be RISC-V machine code. Rather than Linux level application code.

NVIDIA and nouveau

Posted Oct 11, 2022 14:17 UTC (Tue) by wsy (subscriber, #121706) [Link]

Their network accelerator runs its own linux system. *

It is possible their GPU goes the same route in the future.

* https://www.servethehome.com/a-quick-look-at-logging-into...

NVIDIA and nouveau

Posted Oct 13, 2022 13:30 UTC (Thu) by roblucid (guest, #48964) [Link]

Just another reason to NOT buy Nvidia.