-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Remove native_functions.yaml dependency from TensorTopK.cu #66794
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
[ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow For more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 65945dd (more details on the Dr. CI page):
🕵️ 16 new failures recognized by patternsThe following CI failures do not appear to be due to upstream breakages:
|
Job | Step | Action |
---|---|---|
Build | 🔁 rerun | |
pytorch android gradle custom build single architecture (for PR) | 🔁 rerun | |
Build | 🔁 rerun | |
Test | 🔁 rerun | |
Build | 🔁 rerun | |
Test | 🔁 rerun | |
Build | 🔁 rerun | |
Unknown | 🔁 rerun | |
Unknown | 🔁 rerun | |
Unknown | 🔁 rerun | |
Unknown | 🔁 rerun | |
Unknown | 🔁 rerun |
This comment was automatically generated by Dr. CI (expand for details).
Please report bugs/suggestions to the (internal) Dr. CI Users group.
[ghstack-poisoned]
[ghstack-poisoned]
[ghstack-poisoned]
@dagitses has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Differential Revision: [D31856104](https://our.internmc.facebook.com/intern/diff/D31856104) [ghstack-poisoned]
Differential Revision: [D31856104](https://our.internmc.facebook.com/intern/diff/D31856104) [ghstack-poisoned]
Differential Revision: [D31856104](https://our.internmc.facebook.com/intern/diff/D31856104) [ghstack-poisoned]
@dagitses has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Differential Revision: [D31856104](https://our.internmc.facebook.com/intern/diff/D31856104) [ghstack-poisoned]
Differential Revision: [D31856104](https://our.internmc.facebook.com/intern/diff/D31856104) [ghstack-poisoned]
@dagitses has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
|
||
void launch_gather_topk_kernel( | ||
const TensorBase& self, int64_t k, int64_t dim, bool largest, bool sorted, | ||
const TensorBase& values, const TensorBase& indices) { | ||
int numDims = self.dim(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, what is the criteria by which these four lines remain in the .cu but the if (k == 0)
moves?
My understanding of your work is that you are prioritizing the following:
- pull out dependencies of Tensor and ATen.h from .cu files.
- less important but where possible, extract other unnecessary dependencies from .cu files
- take advantage of other opportunities to move code out of .cu files
Is that roughly the prioritization of your approach here? Under that, moving this block and the k == 0
check both fit under 3) as the lowest priority.
Reordering the k == 0
check does change the behavior since it now avoids the check about having too many dimensions. Is that OK? FWIW, I like the idea of being stringent on inputs rather than letting a loophole like this let the user get away with an invalid input.
Changing topic altogether, do you think that splitting code up this way causes any meaningful harm by creating cross-module optimization barriers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
out of curiosity, what is the criteria by which these four lines remain in the .cu but the
if (k == 0)
moves?
The return
statement must go in the .cpp
file function so we don't launch the sorting kernels. It would make sense to keep the MAX_DIMS
checks together with it, but MAX_DIMS
is defined in a .cuh
header file so needs nvcc
:
constexpr int MAX_DIMS = 25; |
My understanding of your work is that you are prioritizing the following: [...]
This is mostly right, although .cu
isn't actually in my criteria anywhere. I'm currently focusing on files that depend on native_functions.yaml
, prioritized by their compile time (to maximize impact). It just so happens that cuda code is much slower to compile so the top of the list is all cuda files. Somewhat interestingly, GridSample.cpp
was above GridSample.cu
in compile time which is why that PR changes both.
The top of the list at the moment looks like this:
caffe2/CMakeFiles/torch_cuda_cu.dir/__/aten/src/ATen/native/cuda/DistributionBernoulli.cu.o
1m 13s 344ms
caffe2/CMakeFiles/torch_cuda_cu.dir/__/aten/src/ATen/native/cuda/ForeachUnaryOp.cu.o
1m 12s 425ms
caffe2/CMakeFiles/torch_cuda_cu.dir/__/aten/src/ATen/native/cuda/Distributions.cu.o
1m 8s 582ms
caffe2/CMakeFiles/torch_cuda_cu.dir/__/aten/src/ATen/native/cuda/Indexing.cu.o
1m 7s 595ms
caffe2/CMakeFiles/torch_cuda_cu.dir/__/aten/src/ATen/native/cuda/group_norm_kernel.cu.o
0m 59s 931ms
caffe2/CMakeFiles/torch_cuda_cu.dir/__/aten/src/ATen/RegisterCUDA.cpp.o
0m 55s 472ms
FWIW, I like the idea of being stringent on inputs rather than letting a loophole like this let the user get away with an invalid input.
I wouldn't say that applies here since MAX_DIMS
is an implementation limitation not an invalid input. If, for example, matmul
allowed empty tensors to have a shape mismatch then I would agree.
Changing topic altogether, do you think that splitting code up this way causes any meaningful harm by creating cross-module optimization barriers?
I don't think there's much the compiler can do here, but things like calling the same tensor method in both functions will have some impact (especially for virtual methods). However, generally speaking, the heavy lifting of these functions are done in the cuda runtime to actually launch the kernel. So, if there is any slow down I expect it to be fairly minimal.
Differential Revision: [D31856104](https://our.internmc.facebook.com/intern/diff/D31856104) [ghstack-poisoned]
Differential Revision: [D31856104](https://our.internmc.facebook.com/intern/diff/D31856104) [ghstack-poisoned]
@dagitses has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
…ytorch#66794)" This reverts commit 4262c89.
…pK.cu (pytorch#66794)"" This reverts commit a9a489e.
Stack from ghstack:
Differential Revision: D31856104