8000 Memory Leak in Distributed RPC nightly · Issue #65670 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Memory Leak in Distributed RPC nightly #65670

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jeremysalwen opened this issue Sep 26, 2021 · 3 comments
Open

Memory Leak in Distributed RPC nightly #65670

jeremysalwen opened this issue Sep 26, 2021 · 3 comments
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@jeremysalwen
Copy link
jeremysalwen commented Sep 26, 2021

🐛 Bug

I have encountered a memory leak using Pytorch Distributed RPC, on the latest nightly build. I have stateless workers that are accepting and executing RPCs, running in their own process. As they receive (and complete) more and more RPCs, they slowly leak GBs of memory each. I have run tracemalloc to monitor their memory usage, but tracemalloc reports that the memory usage of python objects is not increasing significantly. This suggests that the leak is occurring inside a C extension somewhere in the Pytorch Distributed RPC implementation. I am unsure how to debug further where the memory leak is occurring within Pytorch.

To Reproduce

Steps to reproduce the behavior:

  1. Check out the tracemalloc branch of https://github.com/JDBumgardner/stone_ground_hearth_battles
  2. Run python3 /home/jeremy/PycharmProjects/hearthstone_battlegrounds/hearthstone/training/pytorch/ppo.py
  3. Monitor the memory usage over the course of an hour, and view the tracemalloc output printed to console

Expected behavior

Memory usage levels out after first iteration.

Environment

Collecting environment information...
PyTorch version: 1.10.0.dev20210922+cu102
Is debug build: False
CUDA used to build PyTorch: 10.2
ROCM used to build PyTorch: N/A

OS: Debian GNU/Linux 10 (buster) (x86_64)
GCC version: (Debian 10.2.1-6) 10.2.1 20210110
Clang version: 11.0.1-2
CMake version: version 3.18.4
Libc version: glibc-2.31

Python version: 3.9.1+ (default, Jan 20 2021, 14:49:22) [GCC 10.2.1 20210110] (64-bit runtime)
Python platform: Linux-4.19.0-11-amd64-x86_64-with-glibc2.31
Is CUDA available: True
CUDA runtime version: 10.2.89
GPU models and configuration: GPU 0: GeForce GTX 960
Nvidia driver version: 440.33.01
cuDNN version: Could not collect
HIP runtime version: N/A
MIOpen runtime version: N/A

Versions of relevant libraries:
[pip3] numpy==1.19.5
[pip3] torch==1.10.0.dev20210922+cu102
[pip3] torchaudio==0.10.0.dev20210922+cu102
[pip3] torchvision==0.11.0.dev20210922+cu102
[conda] Could not collect

cc @pietern @mrshenli @pritamdamania87 @zhaojuanmao @satgera @rohan-varma @gqchen @aazzolini @osalpekar @jiayisuse @SciPioneer @H-Huang @gcramer23

@facebook-github-bot facebook-github-bot added the oncall: distributed Add this issue/PR to distributed oncall triage queue label Sep 26, 2021
@wayi1
Copy link
Contributor
wayi1 commented Sep 27, 2021

Thanks for reporting this bug! Does this bug only appear on the latest nightly build or also on the latest stable version?

@wayi1 wayi1 added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Sep 27, 2021
@jeremysalwen
Copy link
Author

Yes, this also appears in the latest stable.

@mrshenli
Copy link
Contributor

Hey @jeremysalwen, thanks for reporting this.

What layers (conv, linear, etc.) are used in your model? Is it CPU or GPU training? Does controlling the number of RPC thread help? (see discussion in this issue: #61920)

And could you please confirm that there is no leak when running your model using multi-thread? (see discussion at #64412)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
oncall: distributed Add this issue/PR to distributed oncall triage queue triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

4 participants
0