8000 Remove native_functions.yaml dependency from ScanKernels.cu by peterbell10 · Pull Request #66620 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Remove native_functions.yaml dependency from ScanKernels.cu #66620

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 6 commits into from

Conversation

peterbell10
Copy link
Collaborator
@peterbell10 peterbell10 commented Oct 14, 2021

This splits the Tensor-dependant code out into a cpp file.

A slight complicating factor is `scan_dim` using `copy_` to handle
non-contiguous out arguments. So, I've moved that code into the
caller which does introduce some duplication. Though it's only ~10
lines extra in total.

[ghstack-poisoned]
@pytorch-probot
Copy link
pytorch-probot bot commented Oct 14, 2021
CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/8dcbd8752b8f2eaf945be45502803febb6d47b40/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows Labels (bold enabled) Status
Triggered Workflows
linux-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/noarch, ciflow/xla ✅ triggered
linux-vulkan-bionic-py3.6-clang9 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/vulkan ✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3-clang5-mobile-build ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-dynamic ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static ciflow/all, ciflow/default, ciflow/linux, ciflow/mobile ✅ triggered
linux-xenial-py3.6-clang7-asan ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/sanitizers ✅ triggered
linux-xenial-py3.6-clang7-onnx ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux, ciflow/onnx ✅ triggered
linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7 ciflow/all, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
linux-xenial-py3.6-gcc7-bazel-test ciflow/all, ciflow/bazel, ciflow/cpu, ciflow/default, ciflow/linux ✅ triggered
win-vs2019-cpu-py3 ciflow/all, ciflow/cpu, ciflow/default, ciflow/win ✅ triggered
win-vs2019-cuda11.3-py3 ciflow/all, ciflow/cuda, ciflow/default, ciflow/win ✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
docker-builds ciflow/all 🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux 🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/slow 🚫 skipped
linux-xenial-py3-clang5-mobile-code-analysis ciflow/all, ciflow/linux, ciflow/mobile 🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4 ciflow/all, ciflow/cpu, ciflow/linux 🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/libtorch, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled, ciflow/slow, ciflow/slow-gradcheck 🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7 ciflow/all, ciflow/cuda, ciflow/linux, ciflow/scheduled 🚫 skipped
periodic-win-vs2019-cuda11.1-py3 ciflow/all, ciflow/cuda, ciflow/scheduled, ciflow/win 🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:
# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

@facebook-github-bot
Copy link
Contributor
facebook-github-bot commented Oct 14, 2021

🔗 Helpful links

💊 CI failures summary and remediations

As of commit 8dcbd87 (more details on the Dr. CI page):


💚 💚 Looks good so far! There are no failures yet. 💚 💚


This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

@peterbell10 peterbell10 requested a review from dagitses October 15, 2021 00:07
facebook-github-bot pushed a commit that referenced this pull request Oct 15, 2021
Summary:
These stable sorts currently use a combination of `at::arange`, view ops and `tensor.copy_` to fill in the initial values for the indices before calling into `CUB` to do the actual sort. This is somewhat inefficient because it requires 2 to 4 kernel launches, and the copies all use strided kernels instead of the more efficient contiguous kernels. Instead, a fairly straight-forward custom kernel is more efficient in terms of both CUDA and CPU runtime.

In a simple benchmark I profiled `a.sort(stable=True, dim=1)` for different shapes and single out the kernel invocations for intitializing the index tensors (i.e. the non-`cub` kernels). Note that when the batch dim is `<128` we call `segmented_sort_pairs_by_full_sort` instead of `segmented_sort_pairs`:

| shape        | Master (us) | This PR (us) |
|--------------|:-----------:|:------------:|
| (100, 1000)  |    5.000    |     2.300    |
| (1000, 100)  |    2.070    |     1.090    |
| (100, 10000) |    87.34    |     26.47    |
| (1000, 1000) |    28.63    |     20.27    |

Of course for sufficiently large inputs, the overall runtime is dominated by the actual sort. But I have another motive of wanting to remove operator the calls from the middle of this kernel launch code. This change makes it easier to split the kernel code that needs to be compiled with `nvcc` into it's own file that doesn't include `Tensor.h`, similar to what I'm doing in #66620.

Pull Request resolved: #66668

Reviewed By: H-Huang

Differential Revision: D31693722

Pulled By: ngimel

fbshipit-source-id: 5765926e4dbbc7a20d2940c098ed093b3de2204e
This splits the Tensor-dependant code out into a cpp file.

A slight complicating factor is `scan_dim` using `copy_` to handle
non-contiguous out arguments. So, I've moved that code into the
caller which does introduce some duplication. Though it's only ~10
lines extra in total.

[ghstack-poisoned]
This splits the Tensor-dependant code out into a cpp file.

A slight complicating factor is `scan_dim` using `copy_` to handle
non-contiguous out arguments. So, I've moved that code into the
caller which does introduce some duplication. Though it's only ~10
lines extra in total.

[ghstack-poisoned]
wconstab pushed a commit that referenced this pull request Oct 20, 2021
Summary:
These stable sorts currently use a combination of `at::arange`, view ops and `tensor.copy_` to fill in the initial values for the indices before calling into `CUB` to do the actual sort. This is somewhat inefficient because it requires 2 to 4 kernel launches, and the copies all use strided kernels instead of the more efficient contiguous kernels. Instead, a fairly straight-forward custom kernel is more efficient in terms of both CUDA and CPU runtime.

In a simple benchmark I profiled `a.sort(stable=True, dim=1)` for different shapes and single out the kernel invocations for intitializing the index tensors (i.e. the non-`cub` kernels). Note that when the batch dim is `<128` we call `segmented_sort_pairs_by_full_sort` instead of `segmented_sort_pairs`:

| shape        | Master (us) | This PR (us) |
|--------------|:-----------:|:------------:|
| (100, 1000)  |    5.000    |     2.300    |
| (1000, 100)  |    2.070    |     1.090    |
| (100, 10000) |    87.34    |     26.47    |
| (1000, 1000) |    28.63    |     20.27    |

Of course for sufficiently large inputs, the overall runtime is dominated by the actual sort. But I have another motive of wanting to remove operator the calls from the middle of this kernel launch code. This change makes it easier to split the kernel code that needs to be compiled with `nvcc` into it's own file that doesn't include `Tensor.h`, similar to what I'm doing in #66620.

Pull Request resolved: #66668

Reviewed By: H-Huang

Differential Revision: D31693722

Pulled By: ngimel

fbshipit-source-id: 5765926e4dbbc7a20d2940c098ed093b3de2204e
@peterbell10
Copy link
Collaborator Author

@dagitses

This splits the Tensor-dependant code out into a cpp file.

A slight complicating factor is `scan_dim` using `copy_` to handle
non-contiguous out arguments. So, I've moved that code into the
caller which does introduce some duplication. Though it's only ~10
lines extra in total.

[ghstack-poisoned]
@dagitses
Copy link
Collaborator

@dagitses has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

Copy link
Collaborator
@dagitses dagitses left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great approach but I am concerned about how to communicate to other contributors that this ought to be done going forward and how to do it.

The philosophy definitely needs to be documented, but the harder part is making sure everyone knows it or it is discoverable at the right time. Do you have any better ideas than linking it extensively in CUDA implementation files?

For example, I could see something like:
#define TORCH_ASSERT_NO_OPERATORS // see link to document

TensorArg input_arg{ self, "input", 3 };
checkAllSameGPU(__func__, {output_arg, indices_arg, input_arg});

auto values_ = contiguous_out_arg(values);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

at head, the epilogue of this function exists only in a single body, so it's a slight step back to be duplicating it now.

We could easily templatize this implementation on the launcher.

Don't do anything about this now, maybe there's better opportunities for authoring this when looking at all the changes.

}

void cumprod_cuda_kernel(const Tensor& result, const Tensor& self, int64_t dim) {
auto result_ = contiguous_out_arg(result);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same deal with this scaffolding.

@@ -176,7 +176,7 @@ endif()
if(BUILD_SPLIT_CUDA)
# Splitting the source files that'll be in torch_cuda between torch_cuda_cu and torch_cuda_cpp
foreach(tmp ${Caffe2_GPU_SRCS})
if("${tmp}" MATCHES "(.*aten.*\\.cu|.*(b|B)las.*|.*((s|S)olver|Register.*CUDA|Legacy|THC|TensorShapeCUDA|BatchLinearAlgebra|ReduceOps|Equal|Activation).*\\.cpp)" AND NOT "${tmp}" MATCHES ".*(THC((CachingHost)?Allocator|General)).*")
if("${tmp}" MATCHES "(.*aten.*\\.cu|.*(b|B)las.*|.*((s|S)olver|Register.*CUDA|Legacy|THC|TensorShapeCUDA|BatchLinearAlgebra|ReduceOps|Equal|Activation|ScanKernels).*\\.cpp)" AND NOT "${tmp}" MATCHES ".*(THC((CachingHost)?Allocator|General)).*")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to write this conditional so that it performs efficiently and also doesn't require continuously extending a long line?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I was thinking about having a list of exceptions rather than this massive regex. I can prioritize that change for this week.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't think you have to prioritize it so long as we have a plan to avoid this potential maintenance problem.

This splits the Tensor-dependant code out into a cpp file.

A slight complicating factor is `scan_dim` using `copy_` to handle
non-contiguous out arguments. So, I've moved that code into the
caller which does introduce some duplication. Though it's only ~10
lines extra in total.

Differential Revision: [D31856106](https://our.internmc.facebook.com/intern/diff/D31856106)

[ghstack-poisoned]
This splits the Tensor-dependant code out into a cpp file.

A slight complicating factor is `scan_dim` using `copy_` to handle
non-contiguous out arguments. So, I've moved that code into the
caller which does introduce some duplication. Though it's only ~10
lines extra in total.

Differential Revision: [D31856106](https://our.internmc.facebook.com/intern/diff/D31856106)

[ghstack-poisoned]
@dagitses
Copy link
Collaborator
dagitses commented Nov 2, 2021

@dagitses has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

@facebook-github-bot
Copy link
Contributor

@dagitses merged this pull request in 7deb172.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants
0