8000 Unhelpful CrossEntropyLoss dimension error message · Issue #1328 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Unhelpful CrossEntropyLoss dimension error message #1328

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
jsuarez5341 opened this issue Apr 22, 2017 · 8 comments
Open

Unhelpful CrossEntropyLoss dimension error message #1328

jsuarez5341 opened this issue Apr 22, 2017 · 8 comments
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: error checking Bugs related to incorrect/lacking error checking module: loss Problem is related to loss function triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module

Comments

@jsuarez5341
Copy link
jsuarez5341 commented Apr 22, 2017

I believe I've stumbled upon a slight whoops in nn.CrossEntropyLoss(). If the criterion is called with (a, y) where a in (N, C) and y in (N) such that some yi > C, I get the internal error message below (took a while to parse)... seems like this could use a wrapper. A simple note following the internal error would suffice--how about: "Ensure the class dimension of the predictions matches the class dimension of the targets" ?

THCudaCheck FAIL file=/py/conda-bld/pytorch_1490903321756/work/torch/lib/THC/generic/THCTensorCopy.c line=65 error=59 : device-side assert triggered

System: Ubuntu 16.06, Python 3.6 (conda install).

cc @ngimel

@apaszke
Copy link
Contributor
apaszke commented Apr 24, 2017

There's no way to do that. CUDA doesn't allow adding error messages to asserts, and it's the least invasive way (perf wise) in which we can catch these errors 😕

@apaszke apaszke closed this as completed Apr 24, 2017
@jsuarez5341
Copy link
Author

That's understandable but also horrifying considering "We hope you never spend hours debugging your code because of bad stack traces or asynchronous and opaque execution engines." Not exactly a stack trace, but even harder to parse. Would it make sense to add an option to turn on more expensive logging? Didn't see something like that already in place in the main documentation.

@soumith
Copy link
Member
soumith commented Apr 26, 2017

we've thought about this very hard. there is a hard technical limitation in the cuda api on device asserts. However, it is possible that we can try improve a generic error message when device asserts are triggered that roughly covers all asserts. I'll improve this

@soumith soumith reopened this Apr 26, 2017
@marcj
Copy link
marcj commented Apr 26, 2017

In software development, one of the most important things are good error messages. UX suffers dramatically when you need to spent hours to find out, you made a relatively stupid user/api error. Also, if performance is more important than UX/DX than Pytorch should introduce a developer/debug mode, where the lib spends a bit more time in generating self explaining and good error messages - if activated. It's really the key to be most efficient and not to waste time.

@soumith
Copy link
Member
soumith commented Apr 26, 2017

thanks for your advice. we understand and know what you are saying. as an open source project we are bootstrapped for resources and always have to prioritize things.
if you have time, feel free to improve error messages in pytorch, we even have a list of known bad error messages that need to be improved in #39

@jsuarez5341
Copy link
Author

Hey, PyTorch is great already--I recently migrated everything over from TensorFlow in half the code and 1/5 the time. I can certainly help write better error messages, but I am a researcher with little to no experience deving on large frameworks. I realize that priorities have to be set--my hope is just that this thread is kept somewhere for eventual consideration, as there's definitely another 2-3X productivity to be squeezed out of debugging time. Seconded that a developer/debug mode should be introduced at some point

@jekbradbury
Copy link
Contributor

I think a generic "device-side assert triggered: perhaps you have an out-of-bounds index? try running on CPU" would be a good first step. Chainer has a full-fledged debug mode that catches OOB and NaNs, but I don't think that provides all that much that running on CPU wouldn't.

@weiyangfb weiyangfb assigned weiyangfb and unassigned weiyangfb Aug 3, 2018
@ezyang ezyang added feature A request for a proper, new feature. and removed enhancement labels Apr 1, 2019
@fmassa
Copy link
Member
fmassa commented Oct 23, 2019

This is being worked on in #26776

@fmassa fmassa added module: cuda Related to torch.cuda, and CUDA support in general module: operators triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module and removed medium priority (this tag is deprecated) labels Oct 23, 2019
@mruberry mruberry added module: error checking Bugs related to incorrect/lacking error checking module: loss Problem is related to loss function and removed feature A request for a proper, new feature. module: operators (deprecated) labels Oct 10, 2020
jjsjann123 added a commit to jjsjann123/pytorch that referenced this issue Mar 2, 2022
addding a few more debug dump and a quick doc helping people getting python repros;
removing obsolete code.
akashveramd pushed a commit to akashveramd/pytorch that referenced this issue Apr 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cuda Related to torch.cuda, and CUDA support in general module: error checking Bugs related to incorrect/lacking error checking module: loss Problem is related to loss function triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

No branches or pull requests

10 participants
0