Memory Leak in Distributed RPC nightly #65670

jeremysalwen · 2021-09-26T19:43:24Z

🐛 Bug

I have encountered a memory leak using Pytorch Distributed RPC, on the latest nightly build. I have stateless workers that are accepting and executing RPCs, running in their own process. As they receive (and complete) more and more RPCs, they slowly leak GBs of memory each. I have run tracemalloc to monitor their memory usage, but tracemalloc reports that the memory usage of python objects is not increasing significantly. This suggests that the leak is occurring inside a C extension somewhere in the Pytorch Distributed RPC implementation. I am unsure how to debug further where the memory leak is occurring within Pytorch.

To Reproduce

Steps to reproduce the behavior:

Check out the tracemalloc branch of https://github.com/JDBumgardner/stone_ground_hearth_battles
Run python3 /home/jeremy/PycharmProjects/hearthstone_battlegrounds/hearthstone/training/pytorch/ppo.py
Monitor the memory usage over the course of an hour, and view the tracemalloc output printed to console

Expected behavior

Memory usage levels out after first iteration.

Environment

Collecting environment information...
PyTorch version: 1.10.0.dev20210922+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 10 (buster) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: 11.0.1-2
CMake version: version 3.18.4
Libc version: glibc-2.31

Python version: 3.9.1+ (default, Jan 20 2021, 14:49:22) [GCC 10.2.1 20210110] (64-bit runtime)
Python platform: Linux-4.19.0-11-amd64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration: GPU 0: GeForce GTX 960
Nvidia driver version: 440.33.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.10.0.dev20210922+cu102
[pip3] torchaudio==0.10.0.dev20210922+cu102
[pip3] torchvision==0.11.0.dev20210922+cu102
[conda] Could not collect

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @gcramer23

The text was updated successfully, but these errors were encountered:

wayi1 · 2021-09-27T04:57:00Z

Thanks for reporting this bug! Does this bug only appear on the latest nightly build or also on the latest stable version?

jeremysalwen · 2021-09-27T19:30:10Z

Yes, this also appears in the latest stable.

mrshenli · 2021-09-27T20:25:58Z

Hey @jeremysalwen, thanks for reporting this.

What layers (conv, linear, etc.) are used in your model? Is it CPU or GPU training? Does controlling the number of RPC thread help? (see discussion in this issue: #61920)

And could you please confirm that there is no leak when running your model using multi-thread? (see discussion at #64412)

facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 26, 2021

wayi1 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Memory Leak in Distributed RPC nightly #65670

Memory Leak in Distributed RPC nightly #65670

Uh oh!

Uh oh!

Uh oh!

Memory Leak in Distributed RPC nightly #65670

Memory Leak in Distributed RPC nightly #65670

Comments

Uh oh!

🐛 Bug

To Reproduce

Expected behavior

Environment

Uh oh!

Uh oh!

Uh oh!