8000 GPU hangs when setting sriov_numvfs · Issue #279 · strongtz/i915-sriov-dkms · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

GPU hangs when setting sriov_numvfs #279

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
leo9800 opened this issue Apr 1, 2025 · 7 comments
Open

GPU hangs when setting sriov_numvfs #279

leo9800 opened this issue Apr 1, 2025 · 7 comments

Comments

@leo9800
Copy link
Contributor
leo9800 commented Apr 1, 2025

I had experienced an issue with i915-sriov-dkms which result in GPU hangs.

My system informations are listed as below:

  • Intel NUC11PAHi5
  • Intel i5-1135G7, Xe iGPU
  • Arch Linux (Linux 6.13.8)
  • Gnome DE

My system is Intel NUC11PAHi5, which has an i5-1135G7 CPU and Xe iGPU, and OS is Arch Linux, with Gnome as DE.

The procedure below is committed to enable SR-IOV for iGPU:

  1. install i915-sriov-dkms from AUR
  2. append intel_iommu=on iommu=pt to kernel command line
  3. edit /etc/modprobe.d/i915.conf:
# Uncomment the 3 lines below to set parameters for i915 and avoid xe being loaded.
blacklist xe
options i915 enable_guc=3
options i915 max_vfs=7
  1. edit /etc/tmpfiles.d/i915-set-sriov-numvfs.conf:
#Path                                              Mode UID  GID  Age Argument
#Uncomment the next line and change the argument to the number of VFs you want
w /sys/devices/pci0000:00/0000:00:02.0/sriov_numvfs -    -    -    -   1
  1. reboot

After rebooting, the gdm hangs while switching to another tty with Control+Alt+Fx is possible.


I also tried comment out the line in /etc/tmpfiles.d/i915-set-sriov-numvfs.conf which sets sriov_numvfs, after reboot, the system worked fine.

I tried to manually set this by invoking echo 7 | sudo tee /sys/devices/pci0000:00/0000:00:02.0/sriov_numvfs.

After doing so, gnome crashed, but everything worked like a charm once gnome restarted.

# gnome crash log
Mar 20 01:19:44 Leo-NUC kernel: i915 0000:00:02.0: Enabled 7 VFs
Mar 20 01:19:43 Leo-NUC gnome-shell[1816]: Failed to hotplug secondary gpu '/dev/dri/renderD129': GDBus.Error:System.Error.ENODEV: No such device
Mar 20 01:19:44 Leo-NUC gnome-shell[1816]: g_close(fd:0) failed with EBADF. The tracking of file descriptors got messed up
Mar 20 01:19:44 Leo-NUC gnome-shell[1816]: Failed to hotplug secondary gpu '/dev/dri/card0': No suitable mode setting backend found
Mar 20 01:19:44 Leo-NUC gnome-shell[1816]: Failed to hotplug secondary gpu '/dev/dri/renderD130': GDBus.Error:System.Error.ENODEV: No such device
Mar 20 01:19:44 Leo-NUC gnome-shell[1816]: Failed to hotplug secondary gpu '/dev/dri/renderD131': GDBus.Error:System.Error.ENODEV: No such device
Mar 20 01:19:44 Leo-NUC gnome-shell[1816]: g_close(fd:0) failed with EBADF. The tracking of file descriptors got messed up
Mar 20 01:19:44 Leo-NUC gnome-shell[1816]: Failed to hotplug secondary gpu '/dev/dri/card2': No suitable mode setting backend found

I own another NUC unit with identical model, which runs as a server without any DE. (it ends up login tty after booted up) I also tried the procedure above with this system, everything seems worked fine, no hangs, no freeze.

I also swap the SSDs (which Arch Linux installations reside on) of the 2 systems, The server (now with the desktop's SSD) hangs when launching gdm, the desktop (now with the server's SSD) works perfectly. The purpose doing so is check potential hardware issue.

@leo9800
Copy link
Contributor Author
leo9800 commented Apr 1, 2025

Also, if reboot the desktop after sudo systemctl disable gdm.service, (which makes it identical with the server installation in terms of DE/graphics stuffs) it worked, no more hanging.

Unfortunately the 2 NUCs with same model are the only 2 gadgets I owned which support SR-IOV based Intel vGPU. (not the now-deprecated GVT-g)

Thus I have no option to try to reproduce it on another platform/CPU.

@bbaa-bbaa
Copy link
Contributor

I think this might be related to gnome/mutter not handling hotplug of VFs correctly.
Maybe you should disable hotplug detection of mutter, but I can't find an option about it.

A possible solution is to delay the startup of gdm or load the i915 module as early as possible and enable VF.
For the latter, using EarlyKMS and setting sysfs parameters with Runtime_hooks.

@leo9800
Copy link
Contributor Author
leo9800 commented Apr 1, 2025

@bbaa-bbaa Thanks for your instruction!

I didnot try the first approach (delay gdm after sysfs sriov_numvfs altering) but I go with the second one with following procedure:

  1. ensure kernel cmdline contains: intel_iommu=on iommu=pt i915.enable_guc=3 i915.max_vfs=7 module_blacklist=xe
  2. create /etc/initcpio/install/i915-sriov with:
#!/bin/bash

build() {
	add_module i915
	add_runscript
}

help() {
	echo i915 with SR-IOV support
}
  1. create /etc/initcpio/hooks/i915-sriov with
#!/usr/bin/ash

run_hook() {
	modprobe i915
	i = 0
	while [[ ! -f '/sys/devices/pci0000:00/0000:00:02.0/sriov_numvfs' ]]; do
		if [[ $i -eq 3 ]]; then
			exit 1
		fi
		sleep 1
		i=$((i+1))
	done
	echo 3 > '/sys/devices/pci0000:00/0000:00:02.0/sriov_numvfs'
}
  1. add i915-sriov to HOOK=() list in mkinitcpio.conf
  2. sudo mkinitcpio -P
  3. reboot

*: acknowledgement for step 2 & 3: https://mop.koeln/blog/custom-mkinitcpio-hooks/

This issue persists. gnome-shell is yielding tons of error logs similar to "Page flip failed: drmModeAtomicCommit: No space left on device" and gdm refuse to work properly.

Apr 02 01:18:19 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:19 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:19 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:19 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:19 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:19 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:20 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:21 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:21 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:21 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:21 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:21 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:18:21 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device

@leo9800
Copy link
Contributor Author
leo9800 commented Apr 1, 2025

Sorry, maybe I did not describe the issue clearly.

There are basically 2 fail patterns I noticed.


The first one occurs when sriov_numvfs is set to non-zero after gdm is started, successful login and gnome DE is started, which crashes gnome DE, pops the gdm login screen again, and after successfully log back in, everything seems to work properly.

The related logs looks like:

Mar 20 01:19:43 Leo-NUC gnome-shell[1816]: Failed to hotplug secondary gpu '/dev/dri/renderD129': GDBus.Error:System.Error.ENODEV: No such device
Mar 20 01:19:44 Leo-NUC gnome-shell[1816]: g_close(fd:0) failed with EBADF. The tracking of file descriptors got messed up
Mar 20 01:19:44 Leo-NUC gnome-shell[1816]: Failed to hotplug secondary gpu '/dev/dri/card0': No suitable mode setting backend found
Mar 20 01:19:44 Leo-NUC gnome-shell[1816]: Failed to hotplug secondary gpu '/dev/dri/renderD130': GDBus.Error:System.Error.ENODEV: No such device
Mar 20 01:19:44 Leo-NUC gnome-shell[1816]: Failed to hotplug secondary gpu '/dev/dri/renderD131': GDBus.Error:System.Error.ENODEV: No such device
Mar 20 01:19:44 Leo-NUC gnome-shell[1816]: g_close(fd:0) failed with EBADF. The tracking of file descriptors got messed up
Mar 20 01:19:44 Leo-NUC gnome-shell[1816]: Failed to hotplug secondary gpu '/dev/dri/card2': No suitable mode setting backend found

Maybe @bbaa-bbaa proposed suggestions for this fail pattern, as the error log contains something related to GPU hotplugging.


The second fail pattern occurs when sriov_numvfs is set before gdm is started, which results in a hanged gdm login screen, no chance to login, but ttys worked fine.

The related logs looks like:

Apr 02 01:16:55 Leo-NUC gnome-shell[970]: Thread 'KMS thread' will be using high priority scheduling
Apr 02 01:16:55 Leo-NUC gnome-shell[970]: Device '/dev/dri/card1' prefers shadow buffer
Apr 02 01:16:55 Leo-NUC gnome-shell[970]: Added device '/dev/dri/card1' (i915) using atomic mode setting.
Apr 02 01:16:55 Leo-NUC gnome-shell[970]: g_close(fd:0) failed with EBADF. The tracking of file descriptors got messed up
Apr 02 01:16:55 Leo-NUC gnome-shell[970]: Failed to open gpu '/dev/dri/card0': No suitable mode setting backend found
Apr 02 01:16:55 Leo-NUC gnome-shell[970]: g_close(fd:0) failed with EBADF. The tracking of file descriptors got messed up
Apr 02 01:16:55 Leo-NUC gnome-shell
8000
[970]: Failed to open gpu '/dev/dri/card2': No suitable mode setting backend found
Apr 02 01:16:55 Leo-NUC gnome-shell[970]: g_close(fd:0) failed with EBADF. The tracking of file descriptors got messed up
Apr 02 01:16:55 Leo-NUC gnome-shell[970]: Failed to open gpu '/dev/dri/card3': No suitable mode setting backend found
Apr 02 01:16:55 Leo-NUC gnome-shell[970]: Created gbm renderer for '/dev/dri/card1'
Apr 02 01:16:55 Leo-NUC gnome-shell[970]: Boot VGA GPU /dev/dri/card1 selected as primary
Apr 02 01:16:55 Leo-NUC gnome-shell[970]: Obtained a high priority EGL context
# unrelated
Apr 02 01:16:55 Leo-NUC gnome-shell[970]: Using public X11 display :1024, (using :1025 for managed services)
Apr 02 01:16:55 Leo-NUC gnome-shell[970]: Using Wayland display name 'wayland-0'
# unrelated
Apr 02 01:16:56 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:16:56 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:16:56 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
Apr 02 01:16:56 Leo-NUC gnome-shell[970]: Page flip failed: drmModeAtomicCommit: No space left on device
# repeated 'Page flip failed' ...

@bbaa-bbaa
Copy link
Contributor

Perhaps we need to set sriov_numvfs after gdm started and before the gnome session launch.
This can be achieved by set a systemd unit file with After=gdm.service and remove all other options that will set sriov_numvfs (tmpfiles/initcpio etc.).

@leo9800
Copy link
Contributor Author
leo9800 commented Apr 5, 2025

@bbaa-bbaa Thanks (again) for your suggestion.

I actually tried this by creating a shell script to set sriov_numvfs and a systemd unit invoking this, and add After=gdm.service to the unit file. But with no luck anyway, gdm still hangs after the script is invoked.

Some further investigation shows that if gdm is started but no user is logged in (when gdm stays at the login page), and sriov_numvfs is set either before or after gdm's starting would both lead to locked-up gdm login page.

Seems the only way is to set sriov_numvfs after user login. (and start of gnome-shell) Gnome shell would crash and returns to gdm login page, but magically, no more freeze.

@leo9800
Copy link
Contributor Author
leo9800 commented Apr 5, 2025

Probably it's time to give up intel GPU virtualization on my workstation and just move the stacks to my server build, which works like a charm following the README of this repo. :-)

Besides, I also tried to set sriov_drivers_autoprobe to 0 (1 as default) to avoid the VFs to be probed by i915 on the host. (recalled what I did for enabling SR-IOV for Mellanox NICs and avoiding the VFs to be probed by mlx5_core and laterly managed by network config utils e.g. systemd-networkd) But again no luck addressing this problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0