Memory Leak in Distributed RPC nightly #65670
Labels
oncall: distributed
Add this issue/PR to distributed oncall triage queue
triaged
This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Uh oh!
There was an error while loading. Please reload this page.
🐛 Bug
I have encountered a memory leak using Pytorch Distributed RPC, on the latest nightly build. I have stateless workers that are accepting and executing RPCs, running in their own process. As they receive (and complete) more and more RPCs, they slowly leak GBs of memory each. I have run tracemalloc to monitor their memory usage, but tracemalloc reports that the memory usage of python objects is not increasing significantly. This suggests that the leak is occurring inside a C extension somewhere in the Pytorch Distributed RPC implementation. I am unsure how to debug further where the memory leak is occurring within Pytorch.
To Reproduce
Steps to reproduce the behavior:
tracemalloc
branch of https://github.com/JDBumgardner/stone_ground_hearth_battlespython3 /home/jeremy/PycharmProjects/hearthstone_battlegrounds/hearthstone/training/pytorch/ppo.py
Expected behavior
Memory usage levels out after first iteration.
Environment
Collecting environment information...
PyTorch version: 1.10.0.dev20210922+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A
OS: Debian GNU/Linux 10 (buster) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: 11.0.1-2
CMake version: version 3.18.4
Libc version: glibc-2.31
Python version: 3.9.1+ (default, Jan 20 2021, 14:49:22) [GCC 10.2.1 20210110] (64-bit runtime)
Python platform: Linux-4.19.0-11-amd64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration: GPU 0: GeForce GTX 960
Nvidia driver version: 440.33.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A
Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.10.0.dev20210922+cu102
[pip3] torchaudio==0.10.0.dev20210922+cu102
[pip3] torchvision==0.11.0.dev20210922+cu102
[conda] Could not collect
cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @gcramer23
The text was updated successfully, but these errors were encountered: