-
Notifications
You must be signed in to change notification settings - Fork 24.4k
Remove native_functions.yaml dependency from ScanKernels.cu #66620
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This splits the Tensor-dependant code out into a cpp file. A slight complicating factor is `scan_dim` using `copy_` to handle non-contiguous out arguments. So, I've moved that code into the caller which does introduce some duplication. Though it's only ~10 lines extra in total. [ghstack-poisoned]
CI Flow Status⚛️ CI FlowRuleset - Version:
You can add a comment to the PR and tag @pytorchbot with the following commands: # ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun
# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow For more information, please take a look at the CI Flow Wiki. |
🔗 Helpful links
💊 CI failures summary and remediationsAs of commit 8dcbd87 (more details on the Dr. CI page): 💚 💚 Looks good so far! There are no failures yet. 💚 💚 This comment was automatically generated by Dr. CI (expand for details).Please report bugs/suggestions to the (internal) Dr. CI Users group. |
Summary: These stable sorts currently use a combination of `at::arange`, view ops and `tensor.copy_` to fill in the initial values for the indices before calling into `CUB` to do the actual sort. This is somewhat inefficient because it requires 2 to 4 kernel launches, and the copies all use strided kernels instead of the more efficient contiguous kernels. Instead, a fairly straight-forward custom kernel is more efficient in terms of both CUDA and CPU runtime. In a simple benchmark I profiled `a.sort(stable=True, dim=1)` for different shapes and single out the kernel invocations for intitializing the index tensors (i.e. the non-`cub` kernels). Note that when the batch dim is `<128` we call `segmented_sort_pairs_by_full_sort` instead of `segmented_sort_pairs`: | shape | Master (us) | This PR (us) | |--------------|:-----------:|:------------:| | (100, 1000) | 5.000 | 2.300 | | (1000, 100) | 2.070 | 1.090 | | (100, 10000) | 87.34 | 26.47 | | (1000, 1000) | 28.63 | 20.27 | Of course for sufficiently large inputs, the overall runtime is dominated by the actual sort. But I have another motive of wanting to remove operator the calls from the middle of this kernel launch code. This change makes it easier to split the kernel code that needs to be compiled with `nvcc` into it's own file that doesn't include `Tensor.h`, similar to what I'm doing in #66620. Pull Request resolved: #66668 Reviewed By: H-Huang Differential Revision: D31693722 Pulled By: ngimel fbshipit-source-id: 5765926e4dbbc7a20d2940c098ed093b3de2204e
This splits the Tensor-dependant code out into a cpp file. A slight complicating factor is `scan_dim` using `copy_` to handle non-contiguous out arguments. So, I've moved that code into the caller which does introduce some duplication. Though it's only ~10 lines extra in total. [ghstack-poisoned]
This splits the Tensor-dependant code out into a cpp file. A slight complicating factor is `scan_dim` using `copy_` to handle non-contiguous out arguments. So, I've moved that code into the caller which does introduce some duplication. Though it's only ~10 lines extra in total. [ghstack-poisoned]
Summary: These stable sorts currently use a combination of `at::arange`, view ops and `tensor.copy_` to fill in the initial values for the indices before calling into `CUB` to do the actual sort. This is somewhat inefficient because it requires 2 to 4 kernel launches, and the copies all use strided kernels instead of the more efficient contiguous kernels. Instead, a fairly straight-forward custom kernel is more efficient in terms of both CUDA and CPU runtime. In a simple benchmark I profiled `a.sort(stable=True, dim=1)` for different shapes and single out the kernel invocations for intitializing the index tensors (i.e. the non-`cub` kernels). Note that when the batch dim is `<128` we call `segmented_sort_pairs_by_full_sort` instead of `segmented_sort_pairs`: | shape | Master (us) | This PR (us) | |--------------|:-----------:|:------------:| | (100, 1000) | 5.000 | 2.300 | | (1000, 100) | 2.070 | 1.090 | | (100, 10000) | 87.34 | 26.47 | | (1000, 1000) | 28.63 | 20.27 | Of course for sufficiently large inputs, the overall runtime is dominated by the actual sort. But I have another motive of wanting to remove operator the calls from the middle of this kernel launch code. This change makes it easier to split the kernel code that needs to be compiled with `nvcc` into it's own file that doesn't include `Tensor.h`, similar to what I'm doing in #66620. Pull Request resolved: #66668 Reviewed By: H-Huang Differential Revision: D31693722 Pulled By: ngimel fbshipit-source-id: 5765926e4dbbc7a20d2940c098ed093b3de2204e
This splits the Tensor-dependant code out into a cpp file. A slight complicating factor is `scan_dim` using `copy_` to handle non-contiguous out arguments. So, I've moved that code into the caller which does introduce some duplication. Though it's only ~10 lines extra in total. [ghstack-poisoned]
@dagitses has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a great approach but I am concerned about how to communicate to other contributors that this ought to be done going forward and how to do it.
The philosophy definitely needs to be documented, but the harder part is making sure everyone knows it or it is discoverable at the right time. Do you have any better ideas than linking it extensively in CUDA implementation files?
For example, I could see something like:
#define TORCH_ASSERT_NO_OPERATORS // see link to document
TensorArg input_arg{ self, "input", 3 }; | ||
checkAllSameGPU(__func__, {output_arg, indices_arg, input_arg}); | ||
|
||
auto values_ = contiguous_out_arg(values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
at head, the epilogue of this function exists only in a single body, so it's a slight step back to be duplicating it now.
We could easily templatize this implementation on the launcher.
Don't do anything about this now, maybe there's better opportunities for authoring this when looking at all the changes.
} | ||
|
||
void cumprod_cuda_kernel(const Tensor& result, const Tensor& self, int64_t dim) { | ||
auto result_ = contiguous_out_arg(result); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same deal with this scaffolding.
@@ -176,7 +176,7 @@ endif() | |||
if(BUILD_SPLIT_CUDA) | |||
# Splitting the source files that'll be in torch_cuda between torch_cuda_cu and torch_cuda_cpp | |||
foreach(tmp ${Caffe2_GPU_SRCS}) | |||
if("${tmp}" MATCHES "(.*aten.*\\.cu|.*(b|B)las.*|.*((s|S)olver|Register.*CUDA|Legacy|THC|TensorShapeCUDA|BatchLinearAlgebra|ReduceOps|Equal|Activation).*\\.cpp)" AND NOT "${tmp}" MATCHES ".*(THC((CachingHost)?Allocator|General)).*") | |||
if("${tmp}" MATCHES "(.*aten.*\\.cu|.*(b|B)las.*|.*((s|S)olver|Register.*CUDA|Legacy|THC|TensorShapeCUDA|BatchLinearAlgebra|ReduceOps|Equal|Activation|ScanKernels).*\\.cpp)" AND NOT "${tmp}" MATCHES ".*(THC((CachingHost)?Allocator|General)).*") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to write this conditional so that it performs efficiently and also doesn't require continuously extending a long line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I was thinking about having a list of exceptions rather than this massive regex. I can prioritize that change for this week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i don't think you have to prioritize it so long as we have a plan to avoid this potential maintenance problem.
This splits the Tensor-dependant code out into a cpp file. A slight complicating factor is `scan_dim` using `copy_` to handle non-contiguous out arguments. So, I've moved that code into the caller which does introduce some duplication. Though it's only ~10 lines extra in total. Differential Revision: [D31856106](https://our.internmc.facebook.com/intern/diff/D31856106) [ghstack-poisoned]
This splits the Tensor-dependant code out into a cpp file. A slight complicating factor is `scan_dim` using `copy_` to handle non-contiguous out arguments. So, I've moved that code into the caller which does introduce some duplication. Though it's only ~10 lines extra in total. Differential Revision: [D31856106](https://our.internmc.facebook.com/intern/diff/D31856106) [ghstack-poisoned]
@dagitses has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator. |
Stack from ghstack:
This splits the Tensor-dependant code out into a cpp file.
A slight complicating factor is
scan_dim
usingcopy_
to handlenon-contiguous out arguments. So, I've moved that code into the
caller which does introduce some duplication. Though it's only ~10
lines extra in total.
Differential Revision: D31856106