Description
Impacted platform
All server side products, first observed on Grace-Hopper based system
Impacted gdrcopy versions
2.4.1, 2.4.2, 2.4.3
Impacted gdrcopy configs
gdrdrv
driver loaded with module parameter set use_persistent_mapping=1
Scenarios
If gdrcopy persistent mapping mode is enabled,
-
If one process opens a connection to the driver (via
gdr_open
), and intents to expliclity share connection (using UNIX Domain socket) with one or more processes to use the underlying connection, then the cleanup of the driver resources (viagdr_close
) may be executed by one of the non-owning processes, which would be silently ignored therefore leading to CPU and GPU memory leaks. -
If a parent process A forks one or more child process B (instead of linux
fork
+exec
), then connections opened by A can be attempted to be closed by B during an ungraceful termination of processes via signals (SIGSEV
orSIGKILL
), resulting in OS and GPU memory leaks.
By default, if persistent mode is disabled, under both scenarios, GPU resources cleanup is performed through an independent workflow in CUDA driver and hence dropping the request to close this connection is benign.
Irrespective of persistent mode, this bug may lead to small CPU kernel memory leaks.
Signature of the defect
- On coherent platforms, e.g. Grace Hopper systems, GPU memory leaks can lead to unexpected side effects. For example, turning off the
nvidia-persistenced
service may hang, requiring rebooting the machine. - On non-coherent platforms, GPU memory leaks may reduce the functionality or performance of CUDA applications.
Known mitigations
Turn off by setting driver module parameter use_persistent_mapping=0
and reloading the driver.
Fixed gdrcopy version
2.4.4