-
Notifications
You must be signed in to change notification settings - Fork 24.4k
free up dispatch key space (in C++) #72827
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Reland of D34034848 Differential Revision: [D34227616](https://our.internmc.facebook.com/intern/diff/D34227616/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34227616/)! [ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
|
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 1c367d7 (more details on the Dr. CI page):
🕵️ 4 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Job | Step | Action |
---|---|---|
Unknown | 🔁 rerun |
This comment was automatically generated by Dr. CI (expand for details).
Please report bugs/suggestions to the (internal) Dr. CI Users group.
Reland of D34034848 Differential Revision: [D34227616](https://our.internmc.facebook.com/intern/diff/D34227616/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34227616/)! [ghstack-poisoned]
Reland of D34034848 Differential Revision: [D34227616](https://our.internmc.facebook.com/intern/diff/D34227616/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34227616/)! [ghstack-poisoned]
Reland of D34034848 Differential Revision: [D34227616](https://our.internmc.facebook.com/intern/diff/D34227616/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34227616/)! [ghstack-poisoned]
Reland of D34034848 Differential Revision: [D34227616](https://our.internmc.facebook.com/intern/diff/D34227616/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34227616/)! [ghstack-poisoned]
Reland of D34034848 Differential Revision: [D34227616](https://our.internmc.facebook.com/intern/diff/D34227616/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34227616/)! [ghstack-poisoned]
Reland of D34034848 Differential Revision: [D34227616](https://our.internmc.facebook.com/intern/diff/D34227616/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34227616/)! [ghstack-poisoned]
DispatchKey::SparseCsrCPU, | ||
DispatchKey::SparseCsrCUDA, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We'll likely need to detach CPU/CUDA from CSR dispatch keys as well, similar to Sparse
:
DispatchKey::SparseCsrCPU, | |
DispatchKey::SparseCsrCUDA, | |
DispatchKey::SparseCsr, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll probably do that in a followup. For context: There actually isn't a hard requirement on which way to represent SparseCsr
- it's more about whether or not we want to trade off dispatch key slots for space in the runtime operator table. We could either:
(1) Make it a "per-backend functionality key". There would be one dispatch key for SparseCsr
, and any backend (CPU, CUDA, etc) could implement their own kernels for it separately. Advantage: we save some key space, and give every backend the flexibility to override its meaning (instead of being limited to just CPU and CUDA). Minor disadvantage: we blow up the size of the runtime operator table a bit, since we now need to store an extra (# backends) number of slots for SparseCsr, for every operator. (From what you said, this is also probably logically a bit cleaner)
(2) Leave the existing SparseCsrCPU/CUDA
keys, which saves some runtime operator table space.
It sounds like we should eventually do (1)
Reland of D34034848 Differential Revision: [D34227616](https://our.internmc.facebook.com/intern/diff/D34227616/) **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D34227616/)! [ghstack-poisoned]
Summary: X-link: pytorch/pytorch#72827 Reland of D34034848 ghstack-source-id: 152161452 Reviewed By: ezyang Differential Revision: D34227616 fbshipit-source-id: 6d1dd0fd8144dfbd9e194cd7564cce017e7db968
This pull request has been reverted by 9872a06. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk). |
This pull request has been reverted by 9872a06. To re-land this change, please open another pull request, assignthe same reviewers, fix the CI failures that caused the revert and make sure that the failing CI runs on the PR by applying the proper ciflow label (e.g., ciflow/trunk). |
…pace (in C++)""" This PR is a re-land of #69633 (this is the second re-land attempt, the first one is at #72827). The original PR had a memory corruption bug that only surfaced on mobile builds. *Background: Existing Mobile Optimization* Pytorch mobile builds have an existing optimization ([here](https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382) and [here](https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214)), which works as follows: Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc). In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys. The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined [here](https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294). The mobile-optimization currently does not extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). *The Bug* This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on [this](https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294) line). That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the Dispatcher object, backendFallbackKernels_. **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35222806/)! [ghstack-poisoned]
This PR is a re-land of #69633 (this is the second re-land attempt, the first one is at #72827). The original PR had a memory corruption bug that only surfaced on mobile builds. *Background: Existing Mobile Optimization* Pytorch mobile builds have an existing optimization ([here](https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/c10/core/DispatchKey.h#L382) and [here](https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/OperatorEntry.h#L214)), which works as follows: Every operator in pytorch has a "dispatch table" of function pointers, corresponding to all of the (up to 64) different kernels that we might dispatch to when we run an operator in pytorch (autograd, cpu, cuda, complex number support, etc). In mobile builds, the size of that table is shrunk from 64 to 8 to save a bunch of space, because mobile doesn't end up using the functionality associated with most dispatch keys. The dispatcher also has a notion of "fallback kernels", which are kernels that you can register to a particular dispatch key, but should be able to work for "any operator". The array of fallback kernels is defined [here](https://github.com/pytorch/pytorch/blob/cc23725e89713138aa1c81ce5fb4a8dbcd440ccf/aten/src/ATen/core/dispatch/Dispatcher.h#L294). The mobile-optimization currently does not extend to this array (it wouldn't be that useful anyway because there is only one array of fallback kernels globally - vs. there is a separate dispatch table of function pointers per operator). *The Bug* This PR actually makes it difficult to enable that optimization separately for the per-operator arrays vs. the fallback array, and incidentally shrunk the size of the fallback array from 64 to 8 for mobile (that happened on [this](https://github.com/pytorch/pytorch/pull/69633/files#diff-f735cd7aa68f15b624100cbc4bb3b5ea76ffc7c9d3bec3b0ccabaa09609e5319R294) line). That isn't a problem by itself (since mobile doesn't actually use any of the fallbacks that can no longer be stored). However, pytorch core will still register all of those fallback kernels on startup in mobile builds, even if they aren't used. When we tried to register one of those fallbacks on startup, it would try to dump the kernel somewhere in memory past the bounds of the (now smaller) array inside of the Dispatcher object, backendFallbackKernels_. **NOTE FOR REVIEWERS**: This PR has internal Facebook specific changes or comments, please review them on [Phabricator](https://our.internmc.facebook.com/intern/diff/D35222806/)! [ghstack-poisoned]
Reland of #69633, after fixing some internal mobile failures.
Stack from ghstack (oldest at bottom):
Differential Revision: D34227616
NOTE FOR REVIEWERS: This PR has internal Facebook specific changes or comments, please review them on Phabricator!