-
Notifications
You must be signed in to change notification settings - Fork 24.1k
Unhelpful CrossEntropyLoss dimension error message #1328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
There's no way to do that. CUDA doesn't allow adding error messages to asserts, and it's the least invasive way (perf wise) in which we can catch these errors 😕 |
That's understandable but also horrifying considering "We hope you never spend hours debugging your code because of bad stack traces or asynchronous and opaque execution engines." Not exactly a stack trace, but even harder to parse. Would it make sense to add an option to turn on more expensive logging? Didn't see something like that already in place in the main documentation. |
we've thought about this very hard. there is a hard technical limitation in the cuda api on device asserts. However, it is possible that we can try improve a generic error message when device asserts are triggered that roughly covers all asserts. I'll improve this |
In software development, one of the most important things are good error messages. UX suffers dramatically when you need to spent hours to find out, you made a relatively stupid user/api error. Also, if performance is more important than UX/DX than Pytorch should introduce a developer/debug mode, where the lib spends a bit more time in generating self explaining and good error messages - if activated. It's really the key to be most efficient and not to waste time. |
thanks for your advice. we understand and know what you are saying. as an open source project we are bootstrapped for resources and always have to prioritize things. |
Hey, PyTorch is great already--I recently migrated everything over from TensorFlow in half the code and 1/5 the time. I can certainly help write better error messages, but I am a researcher with little to no experience deving on large frameworks. I realize that priorities have to be set--my hope is just that this thread is kept somewhere for eventual consideration, as there's definitely another 2-3X productivity to be squeezed out of debugging time. Seconded that a developer/debug mode should be introduced at some point |
I think a generic "device-side assert triggered: perhaps you have an out-of-bounds index? try running on CPU" would be a good first step. Chainer has a full-fledged debug mode that catches OOB and NaNs, but I don't think that provides all that much that running on CPU wouldn't. |
This is being worked on in #26776 |
addding a few more debug dump and a quick doc helping people getting python repros; removing obsolete code.
I believe I've stumbled upon a slight whoops in nn.CrossEntropyLoss(). If the criterion is called with (a, y) where a in (N, C) and y in (N) such that some yi > C, I get the internal error message below (took a while to parse)... seems like this could use a wrapper. A simple note following the internal error would suffice--how about: "Ensure the class dimension of the predictions matches the class dimension of the targets" ?
THCudaCheck FAIL file=/py/conda-bld/pytorch_1490903321756/work/torch/lib/THC/generic/THCTensorCopy.c line=65 error=59 : device-side assert triggered
System: Ubuntu 16.06, Python 3.6 (conda install).
cc @ngimel
The text was updated successfully, but these errors were encountered: