-
Notifications
You must be signed in to change notifica 8000 tion settings - Fork 24.1k
tensor.dtype.to_complex() crashes kernel after ~100 calls in ipython kernel #124868
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
[Edit] Have a stable reproducer import torch
import gc
import sys
for i in range(50):
d = torch.float32.to_complex()
print(i, sys.getrefcount(d))
print("That's all folks!")
print(gc.get_referrers(d)) Backtrace:
|
I can also repro the crash. Probably caused here, as the returned object is never Lines 85 to 92 in 9bd6e93
|
@tringwald do you want to submit a fix or should I? |
I can submit a PR if you want. There also seem to be some other occurrences where the output of |
Following fixes it diff --git a/torch/csrc/Dtype.cpp b/torch/csrc/Dtype.cpp
index 8eee2a02fae..d043b9096fa 100644
--- a/torch/csrc/Dtype.cpp
+++ b/torch/csrc/Dtype.cpp
@@ -79,7 +79,9 @@ PyObject* THPDtype_to_real(PyObject* _self, PyObject* noargs) {
if (!at::isFloatingType(self->scalar_type)) {
scalar_type = at::toRealValueType(self->scalar_type);
}
- return (PyObject*)torch::getTHPDtype(scalar_type);
+ auto rc = reinterpret_cast<PyObject*>(torch::getTHPDtype(scalar_type));
+ Py_INCREF(rc);
+ return rc;
}
PyObject* THPDtype_to_complex(PyObject* _self, PyObject* noargs) {
@@ -88,7 +90,9 @@ PyObject* THPDtype_to_complex(PyObject* _self, PyObject* noargs) {
if (!at::isComplexType(self->scalar_type)) {
scalar_type = at::toComplexType(self->scalar_type);
}
- return (PyObject*)torch::getTHPDtype(scalar_type);
+ auto rc = reinterpret_cast<PyObject*>(torch::getTHPDtype(scalar_type));
+ Py_INCREF(rc);
+ return rc;
} |
Up to you, I have PR ready as well as regression test, but as you've found it, please go ahead and I'll be happy to review |
If you've already written a test, you can submit the PR. I was just wondering if this here is also a potential problem: pytorch/torch/csrc/tensor/python_tensor.cpp Lines 238 to 251 in 8885638
|
getTHPLayout() just above has the same issue btw.
type_obj.layout = Py_NewRef(torch::getTHPLayout(layout_from_backend(backend)));
type_obj.dtype = Py_NewRef(torch::getTHPDtype(scalarType)); |
This is OK, but not necessary, as struct can have a borrowed reference, as runtime will auto-increase it whenever it wants to bind it to a new variable |
Keeping it open to search for other incref/decref mishaps, but #125154 should have fixed the crash |
No they cannot! See the PR above with the rest of the fix |
I just want to say thank you. I'm very impressed with everyone's work here and the speed with which it was done. |
Finish fixing #124868 re-use our wrap() utils as much as possible and NewRef in other places. Pull Request resolved: #125271 Approved by: https://github.com/colesbury
With the second PR, dtype/layout/memory_format are not consistent and fixed! |
By using `Py_NewRef` Also, wrap `THPDtype_to_real`/`THPDtype_to_complex` calls with `HANDLE_TH_ERRORS` Add regression test for the above issues, by calling to_complex for integral dtypes, that raises an exception and by preserving reference count to the same to_complex/to_real call to detect if leak is happeneing. Replace ```cpp auto dtype = (PyObject*)torch::getTHPDtype(current_dtype); Py_INCREF(dtype); return dtype; ``` with a more compact/streamlined equivalent ```cpp return Py_NewRef(torch::getTHPDtype(current_dtype)); ``` Fixes pytorch#124868 Pull Request resolved: pytorch#125154 Approved by: https://github.com/Skylion007, https://github.com/albanD
…5271) Finish fixing pytorch#124868 re-use our wrap() utils as much as possible and NewRef in other places. Pull Request resolved: pytorch#125271 Approved by: https://github.com/colesbury
By using `Py_NewRef` Also, wrap `THPDtype_to_real`/`THPDtype_to_complex` calls with `HANDLE_TH_ERRORS` Add regression test for the above issues, by calling to_complex for integral dtypes, that raises an exception and by preserving reference count to the same to_complex/to_real call to detect if leak is happeneing. Replace ```cpp auto dtype = (PyObject*)torch::getTHPDtype(current_dtype); Py_INCREF(dtype); return dtype; ``` with a more compact/streamlined equivalent ```cpp return Py_NewRef(torch::getTHPDtype(current_dtype)); ``` Fixes #124868 Pull Request resolved: #125154 Approved by: https://github.com/Skylion007, https://github.com/albanD (cherry picked from commit 744f341)
By using `Py_NewRef` Also, wrap `THPDtype_to_real`/`THPDtype_to_complex` calls with `HANDLE_TH_ERRORS` Add regression test for the above issues, by calling to_complex for integral dtypes, that raises an exception and by preserving reference count to the same to_complex/to_real call to detect if leak is happeneing. Replace ```cpp auto dtype = (PyObject*)torch::getTHPDtype(current_dtype); Py_INCREF(dtype); return dtype; ``` with a more compact/streamlined equivalent ```cpp return Py_NewRef(torch::getTHPDtype(current_dtype)); ``` Fixes pytorch#124868 Pull Request resolved: pytorch#125154 Approved by: https://github.com/Skylion007, https://github.com/albanD (cherry picked from commit 744f341)
By using `Py_NewRef` Also, wrap `THPDtype_to_real`/`THPDtype_to_complex` calls with `HANDLE_TH_ERRORS` Add regression test for the above issues, by calling to_complex for integral dtypes, that raises an exception and by preserving reference count to the same to_complex/to_real call to detect if leak is happeneing. Replace ```cpp auto dtype = (PyObject*)torch::getTHPDtype(current_dtype); Py_INCREF(dtype); return dtype; ``` with a more compact/streamlined equivalent ```cpp return Py_NewRef(torch::getTHPDtype(current_dtype)); ``` Fixes #124868 Pull Request resolved: #125154 Approved by: https://github.com/Skylion007, https://github.com/albanD
* Fix ref leak in `dtype.to_complex()`/`to_real()` (#125154) By using `Py_NewRef` Also, wrap `THPDtype_to_real`/`THPDtype_to_complex` calls with `HANDLE_TH_ERRORS` Add regression test for the above issues, by calling to_complex for integral dtypes, that raises an exception and by preserving reference count to the same to_complex/to_real call to detect if leak is happeneing. Replace ```cpp auto dtype = (PyObject*)torch::getTHPDtype(current_dtype); Py_INCREF(dtype); return dtype; ``` with a more compact/streamlined equivalent ```cpp return Py_NewRef(torch::getTHPDtype(current_dtype)); ``` Fixes #124868 Pull Request resolved: #125154 Approved by: https://github.com/Skylion007, https://github.com/albanD (cherry picked from commit 744f341) * Revert "Fix ref leak in `dtype.to_complex()`/`to_real()` (#125154)" This reverts commit a1b04d8. * Fix ref leak in `dtype.to_complex()`/`to_real()` (#125154) By using `Py_NewRef` Also, wrap `THPDtype_to_real`/`THPDtype_to_complex` calls with `HANDLE_TH_ERRORS` Add regression test for the above issues, by calling to_complex for integral dtypes, that raises an exception and by preserving reference count to the same to_complex/to_real call to detect if leak is happeneing. Replace ```cpp auto dtype = (PyObject*)torch::getTHPDtype(current_dtype); Py_INCREF(dtype); return dtype; ``` with a more compact/streamlined equivalent ```cpp return Py_NewRef(torch::getTHPDtype(current_dtype)); ``` Fixes #124868 Pull Request resolved: #125154 Approved by: https://github.com/Skylion007, https://github.com/albanD (cherry picked from commit 744f341) * Revert "Fix ref leak in `dtype.to_complex()`/`to_real()` (#125154)" This reverts commit 5a28bad. * Refactor autocast C++ APIs to be device-agnostic (#124359) # Motivation This PR aims to refactor autocast **C++** APIs to be device-agnostic and deprecate the device-specific autocast **C++** APIs. In C++ side, - `is_enabled()` -> `is_enabled(device_type)`. - `set_enabled(new_enabled)` -> `set_enabled(device_type, new_enabled)`. - `get_autocast_dtype()` -> `get_autocast_dtype(device_type)` - `set_autocast_dtype(dtype)` -> `set_autocast_dtype(device_type, dtype)` These following C++ APIs are deprecated and should be removed in PyTorch 2.5 - `is_cpu_enabled` - `set_cpu_enabled` - `get_autocast_cpu_dtype` - `set_autocast_cpu_dtype` - `is_xpu_enabled` - `set_xpu_enabled` - `get_autocast_xpu_dtype` - `set_autocast_xpu_dtype` - `is_ipu_enabled` - `set_ipu_enabled` - `get_autocast_ipu_dtype` - `set_autocast_ipu_dtype` - `is_hpu_enabled` - `set_hpu_enabled` - `get_autocast_hpu_dtype` - `set_autocast_hpu_dtype` - `is_xla_enabled` - `set_xla_enabled` - `get_autocast_xla_dtype` - `set_autocast_xla_dtype` - `is_privateuseone_enabled` - `set_privateuseone_enabled` - `get_autocast_privateuseone_dtype` - `set_autocast_privateuseone_dtype` In Python side, provide 4 generic autocast APIs: - `torch.is_autocast_enabled(device_type)` - `torch.set_autocast_enabled(device_type, new_enabled)` - `torch.get_autocast_dtype(device_type)` - `torch.set_autocast_dtype(device_type, dtype)` # Additional Context We will submit another PR to refactor autocast **Python** APIs based on this PR. Pull Request resolved: #124359 Approved by: https://github.com/jgong5, https://github.com/albanD * refactor autocast python APIs (#124479) Refactor autocast usage scenario in `torch/amp/autocast_mode.py` and `torch/utils/checkpoint.py` to fix the bug - convention conflict between `torch.xxx.get_autocast_xxx_dtype` defined in `autocast_mode.py` and `torch.xxx.get_autocast_dtype` defined in `checkpoint.py`. Use device-agnostic APIs like `torch.get_autocast_dtype`, ..., instead. Pull Request resolved: #124479 Approved by: https://github.com/jgong5, https://github.com/gujinghui, https://github.com/EikanWang, https://github.com/albanD ghstack dependencies: #124359 * Fix ref leak in `dtype.to_complex()`/`to_real()` (#125154) By using `Py_NewRef` Also, wrap `THPDtype_to_real`/`THPDtype_to_complex` calls with `HANDLE_TH_ERRORS` Add regression test for the above issues, by calling to_complex for integral dtypes, that raises an exception and by preserving reference count to the same to_complex/to_real call to detect if leak is happeneing. Replace ```cpp auto dtype = (PyObject*)torch::getTHPDtype(current_dtype); Py_INCREF(dtype); return dtype; ``` with a more compact/streamlined equivalent ```cpp return Py_NewRef(torch::getTHPDtype(current_dtype)); ``` Fixes #124868 Pull Request resolved: #125154 Approved by: https://github.com/Skylion007, https://github.com/albanD * Revert "refactor autocast python APIs (#124479)" This reverts commit 495b0c9. * Revert "Refactor autocast C++ APIs to be device-agnostic (#124359)" This reverts commit 83106b7. --------- Co-authored-by: Nikita Shulga <2453524+malfet@users.noreply.github.com> Co-authored-by: Huy Do <huydhn@gmail.com> Co-authored-by: Yu, Guangye <guangye.yu@intel.com>
Validated that @malfet 's stable repro is fixed with 2.3.1 |
🐛 Describe the bug
The following block of code consistently crashes an ipython kernel without an error traceback after ~100 iterations in the loop:
Running on torch==2.2.2+cu121, and I tried in on a couple google colab instance with torch==2.2.1+cu121 and torch==2.2.0+cpu with the same result.
It crashes with all real dtypes.
It does not crash in a terminal python kernel, but does sometimes cause a segmentation fault after the script ends.
For anyone looking for a workaround, you can just use this:
Versions
environment:
cc @ezyang @anjali411 @dylanbespalko @mruberry @lezcano @nikitaved @amjames @albanD
The text was updated successfully, but these errors were encountered: