-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Fix persistent worker exits before pin_memory thread #71579
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 0802991 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Fixes #1551 As the comment in the code, adding a function to manually delete the iterator with persistent worker. It would invoke `__del__` inside Iterator object and make sure pin_memory_thread exits before worker process. I choose using `atexit` rather than adding `__del__` to DataLoader because it's not safe as the destructor function may not be invoked when Python interpreter exits. [ghstack-poisoned]
Fixes #1551 As the comment in the code, adding a function to manually delete the iterator with persistent worker. It would invoke `__del__` inside Iterator object and make sure pin_memory_thread exits before worker process. I choose using `atexit` rather than adding `__del__` to DataLoader because it's not safe as the destructor function may not be invoked when Python interpreter exits. [ghstack-poisoned]
torch/utils/data/dataloader.py
Outdated
# In some rare cases, persistent workers (daemonic processes) | ||
# would be terminated before `__del__` of iterator is invoked | ||
# when main process exits | ||
# It would cause failure when pin_memory_thread tries to read | ||
# corrupted data from worker_result_queue | ||
# atexit is used to prevent persistent workers exiting before | ||
# pin_memory_thread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the comment.
test/test_dataloader.py
Outdated
@unittest.skipIf(not TEST_CUDA, "CUDA unavailable") | ||
def test_early_exit(self): | ||
import subprocess | ||
proc = subprocess.run([sys.executable, '-c', """\ | ||
import torch | ||
from torch.utils.data import DataLoader, IterableDataset | ||
|
||
class RandomDataset(IterableDataset): | ||
def __init__(self, len, size): | ||
super(RandomDataset).__init__() | ||
self.len = len | ||
self.size = size | ||
|
||
def __iter__(self): | ||
return self | ||
|
||
def __next__(self): | ||
if self.len <= 0: | ||
raise StopIteration | ||
self.len -= 1 | ||
return torch.randn(self.size) | ||
|
||
if __name__ == '__main__': | ||
dl = DataLoader( | ||
RandomDataset(64, (28, 28)), | ||
batch_size=16, | ||
num_workers=2, | ||
pin_memory=True, | ||
persistent_workers=True, | ||
) | ||
|
||
for _ in dl: | ||
break | ||
"""], stderr=subprocess.PIPE) | ||
self.assertTrue(proc.stderr == b'') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test won't pass before this PR.
@ejguan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
# corrupted data from worker_result_queue | ||
# atexit is used to prevent persistent workers exiting before | ||
# pin_memory_thread | ||
if self._persistent_workers and self._pin_memory: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to potentially fix part of #71187 for the case of persistent workers, we may want to remove this condition on self._pin_memory
Summary: Pull Request resolved: #71579 Fixes #1551 As the comment in the code, register a function to terminate persistent workers. Using `atexit` to make sure termination of persistent workers always happens at the end (after pin_memory_thread exits). We need such mechanism because Python interpreter would clean up worker process before DataLoader iterator in some rare cases. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33694867 Pulled By: ejguan fbshipit-source-id: 0847f4d424a0cd6b3c0be8235d505415970254e8
Summary: Pull Request resolved: #71579 Fixes #1551 As the comment in the code, register a function to terminate persistent workers. Using `atexit` to make sure termination of persistent workers always happens at the end (after pin_memory_thread exits). We need such mechanism because Python interpreter would clean up worker process before DataLoader iterator in some rare cases. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33694867 Pulled By: ejguan fbshipit-source-id: 0847f4d424a0cd6b3c0be8235d505415970254e8 (cherry picked from commit 18ad462)
Reverting as it broke Windows test sanity(from https://github.com/pytorch/pytorch/runs/4903195822?check_suite_focus=true):
|
This pull request has been reverted by 401ac7b52a96377814c37dfeaa3f0bdea231d058. To re-land this change, follow these steps. |
This pull request has been reverted by 86aefdc. To re-land this change, follow these steps. |
Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]
Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. [ghstack-poisoned]
@ejguan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. Differential Revision: [D33896537](https://our.internmc.facebook.com/intern/diff/D33896537) [ghstack-poisoned]
@ejguan has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Summary: Pull Request resolved: #71579 Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33896537 Pulled By: ejguan fbshipit-source-id: 36b57eac7523d8aa180180c2b61fc693ea4638ae
Summary: Pull Request resolved: #71579 Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33896537 Pulled By: ejguan fbshipit-source-id: 36b57eac7523d8aa180180c2b61fc693ea4638ae (cherry picked from commit 05add2a)
Hey ejguan. |
@ejguan can you also put a topic label to it? 😊 |
Summary: Pull Request resolved: pytorch/pytorch#71579 Fixes #1551 As the comment in the code, register a function to terminate persistent workers. Using `atexit` to make sure termination of persistent workers always happens at the end (after pin_memory_thread exits). We need such mechanism because Python interpreter would clean up worker process before DataLoader iterator in some rare cases. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33694867 Pulled By: ejguan fbshipit-source-id: 0847f4d424a0cd6b3c0be8235d505415970254e8 (cherry picked from commit 18ad462)
Summary: Pull Request resolved: pytorch/pytorch#71579 Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33896537 Pulled By: ejguan fbshipit-source-id: 36b57eac7523d8aa180180c2b61fc693ea4638ae (cherry picked from commit 05add2a)
Summary: Pull Request resolved: pytorch/pytorch#71579 Fixes #1551 As the comment in the code, register a function to terminate persistent workers. Using `atexit` to make sure termination of persistent workers always happens at the end (after pin_memory_thread exits). We need such mechanism because Python interpreter would clean up worker process before DataLoader iterator in some rare cases. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33694867 Pulled By: ejguan fbshipit-source-id: 0847f4d424a0cd6b3c0be8235d505415970254e8 (cherry picked from commit 18ad462)
Summary: Pull Request resolved: pytorch/pytorch#71579 Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33896537 Pulled By: ejguan fbshipit-source-id: 36b57eac7523d8aa180180c2b61fc693ea4638ae (cherry picked from commit 05add2a)
ghstack-source-id: 2d15b14 Pull Request resolved: pytorch#71579
Summary: Pull Request resolved: pytorch/pytorch#71579 Fixes #1551 As the comment in the code, register a function to terminate persistent workers. Using `atexit` to make sure termination of persistent workers always happens at the end (after pin_memory_thread exits). We need such mechanism because Python interpreter would clean up worker process before DataLoader iterator in some rare cases. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33694867 Pulled By: ejguan fbshipit-source-id: 0847f4d424a0cd6b3c0be8235d505415970254e8 (cherry picked from commit 18ad462)
Summary: Pull Request resolved: pytorch/pytorch#71579 Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33896537 Pulled By: ejguan fbshipit-source-id: 36b57eac7523d8aa180180c2b61fc693ea4638ae (cherry picked from commit 05add2a)
Summary: Pull Request resolved: pytorch/pytorch#71579 Fixes #1551 As the comment in the code, register a function to terminate persistent workers. Using `atexit` to make sure termination of persistent workers always happens at the end (after pin_memory_thread exits). We need such mechanism because Python interpreter would clean up worker process before DataLoader iterator in some rare cases. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33694867 Pulled By: ejguan fbshipit-source-id: 0847f4d424a0cd6b3c0be8235d505415970254e8 (cherry picked from commit 18ad462)
Summary: Pull Request resolved: pytorch/pytorch#71579 Fixes #1551 As the comment in the code, register a function to terminate persistent workers. By adding a reference of these workers in `atexit`, it would prevent Python interpreter kills these persistent worker processes before `pin_memorh_thread` exits. And, if users explicitly kills DataLoader iterator, such function in `atexit` would be a no-op. Test Plan: Imported from OSS Reviewed By: VitalyFedyunin Differential Revision: D33896537 Pulled By: ejguan fbshipit-source-id: 36b57eac7523d8aa180180c2b61fc693ea4638ae (cherry picked from commit 05add2a)
Fixes #1551
Stack from ghstack:
As the comment in the code, register a function to terminate persistent workers.
By adding a reference of these workers in
atexit
, it would prevent Python interpreter kills these persistent worker processes beforepin_memorh_thread
exits.And, if users explicitly kills DataLoader iterator, such function in
atexit
would be a no-op.Differential Revision: D33896537