8000 Observing GPU memory and/or CPU OS memory leaks with `use_persistent_mapping` enabled in `gdrdrv` during multi-process termination · Issue #313 · NVIDIA/gdrcopy · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content
Observing GPU memory and/or CPU OS memory leaks with use_persistent_mapping enabled in gdrdrv during multi-process termination #313
Open
@realarnavgoel

Description

@realarnavgoel

Impacted platform
All server side products, first observed on Grace-Hopper based system

Impacted gdrcopy versions
2.4.1, 2.4.2, 2.4.3

Impacted gdrcopy configs
gdrdrv driver loaded with module parameter set use_persistent_mapping=1

Scenarios
If gdrcopy persistent mapping mode is enabled,

  1. If one process opens a connection to the driver (via gdr_open), and intents to expliclity share connection (using UNIX Domain socket) with one or more processes to use the underlying connection, then the cleanup of the driver resources (via gdr_close) may be executed by one of the non-owning processes, which would be silently ignored therefore leading to CPU and GPU memory leaks.

  2. If a parent process A forks one or more child process B (instead of linux fork + exec), then connections opened by A can be attempted to be closed by B during an ungraceful termination of processes via signals (SIGSEV or SIGKILL), resulting in OS and GPU memory leaks.

By default, if persistent mode is disabled, under both scenarios, GPU resources cleanup is performed through an independent workflow in CUDA driver and hence dropping the request to close this connection is benign.

Irrespective of persistent mode, this bug may lead to small CPU kernel memory leaks.

Signature of the defect

  • On coherent platforms, e.g. Grace Hopper systems, GPU memory leaks can lead to unexpected side effects. For example, turning off the nvidia-persistenced service may hang, requiring rebooting the machine.
  • On non-coherent platforms, GPU memory leaks may reduce the functionality or performance of CUDA applications.

Known mitigations
Turn off by setting driver module parameter use_persistent_mapping=0 and reloading the driver.

Fixed gdrcopy version
2.4.4

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions

    0