Remove native_functions.yaml dependency from ScanKernels.cu #66620

peterbell10 · 2021-10-14T13:37:14Z

Stack from ghstack:

This splits the Tensor-dependant code out into a cpp file.

A slight complicating factor is scan_dim using copy_ to handle
non-contiguous out arguments. So, I've moved that code into the
caller which does introduce some duplication. Though it's only ~10
lines extra in total.

Differential Revision: D31856106

This splits the Tensor-dependant code out into a cpp file. A slight complicating factor is `scan_dim` using `copy_` to handle non-contiguous out arguments. So, I've moved that code into the caller which does introduce some duplication. Though it's only ~10 lines extra in total. [ghstack-poisoned]

pytorch-probot · 2021-10-14T13:37:19Z

CI Flow Status

⚛️ CI Flow

Ruleset - Version: v1
Ruleset - File: https://github.com/pytorch/pytorch/blob/8dcbd8752b8f2eaf945be45502803febb6d47b40/.github/generated-ciflow-ruleset.json
PR ciflow labels: ciflow/default

Workflows	Labels (bold enabled)	Status
Triggered Workflows
linux-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/noarch`, `ciflow/xla`	✅ triggered
linux-vulkan-bionic-py3.6-clang9	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/vulkan`	✅ triggered
linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3-clang5-mobile-build	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-dynamic	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`	✅ triggered
linux-xenial-py3-clang5-mobile-custom-build-static	`ciflow/all`, `ciflow/default`, `ciflow/linux`, `ciflow/mobile`	✅ triggered
linux-xenial-py3.6-clang7-asan	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/sanitizers`	✅ triggered
linux-xenial-py3.6-clang7-onnx	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`, `ciflow/onnx`	✅ triggered
linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
linux-xenial-py3.6-gcc7-bazel-test	`ciflow/all`, `ciflow/bazel`, `ciflow/cpu`, `ciflow/default`, `ciflow/linux`	✅ triggered
win-vs2019-cpu-py3	`ciflow/all`, `ciflow/cpu`, `ciflow/default`, `ciflow/win`	✅ triggered
win-vs2019-cuda11.3-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/default`, `ciflow/win`	✅ triggered
Skipped Workflows
caffe2-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped
docker-builds	`ciflow/all`	🚫 skipped
libtorch-linux-xenial-cuda10.2-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
libtorch-linux-xenial-cuda11.3-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`	🚫 skipped
linux-bionic-cuda10.2-py3.9-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/slow`	🚫 skipped
linux-xenial-py3-clang5-mobile-code-analysis	`ciflow/all`, `ciflow/linux`, `ciflow/mobile`	🚫 skipped
parallelnative-linux-xenial-py3.6-gcc5.4	`ciflow/all`, `ciflow/cpu`, `ciflow/linux`	🚫 skipped
periodic-libtorch-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/libtorch`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-linux-xenial-cuda10.2-py3-gcc7-slow-gradcheck	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`, `ciflow/slow`, `ciflow/slow-gradcheck`	🚫 skipped
periodic-linux-xenial-cuda11.1-py3.6-gcc7	`ciflow/all`, `ciflow/cuda`, `ciflow/linux`, `ciflow/scheduled`	🚫 skipped
periodic-win-vs2019-cuda11.1-py3	`ciflow/all`, `ciflow/cuda`, `ciflow/scheduled`, `ciflow/win`	🚫 skipped

You can add a comment to the PR and tag @pytorchbot with the following commands:

# ciflow rerun, "ciflow/default" will always be added automatically
@pytorchbot ciflow rerun

# ciflow rerun with additional labels "-l <ciflow/label_name>", which is equivalent to adding these labels manually and trigger the rerun
@pytorchbot ciflow rerun -l ciflow/scheduled -l ciflow/slow

For more information, please take a look at the CI Flow Wiki.

facebook-github-bot · 2021-10-14T13:37:20Z

🔗 Helpful links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/66620
📄 Preview docs built from this PR
📄 Preview C++ docs built from this PR
🔧 Opt-in to CIFlow to control what jobs run on your PRs

💊 CI failures summary and remediations

As of commit 8dcbd87 (more details on the Dr. CI page):

💚 💚 Looks good so far! There are no failures yet. 💚 💚

This comment was automatically generated by Dr. CI (expand for details).

Please report bugs/suggestions to the (internal) Dr. CI Users group.

Click here to manually regenerate this comment.

Summary: These stable sorts currently use a combination of `at::arange`, view ops and `tensor.copy_` to fill in the initial values for the indices before calling into `CUB` to do the actual sort. This is somewhat inefficient because it requires 2 to 4 kernel launches, and the copies all use strided kernels instead of the more efficient contiguous kernels. Instead, a fairly straight-forward custom kernel is more efficient in terms of both CUDA and CPU runtime. In a simple benchmark I profiled `a.sort(stable=True, dim=1)` for different shapes and single out the kernel invocations for intitializing the index tensors (i.e. the non-`cub` kernels). Note that when the batch dim is `<128` we call `segmented_sort_pairs_by_full_sort` instead of `segmented_sort_pairs`: | shape | Master (us) | This PR (us) | |--------------|:-----------:|:------------:| | (100, 1000) | 5.000 | 2.300 | | (1000, 100) | 2.070 | 1.090 | | (100, 10000) | 87.34 | 26.47 | | (1000, 1000) | 28.63 | 20.27 | Of course for sufficiently large inputs, the overall runtime is dominated by the actual sort. But I have another motive of wanting to remove operator the calls from the middle of this kernel launch code. This change makes it easier to split the kernel code that needs to be compiled with `nvcc` into it's own file that doesn't include `Tensor.h`, similar to what I'm doing in #66620. Pull Request resolved: #66668 Reviewed By: H-Huang Differential Revision: D31693722 Pulled By: ngimel fbshipit-source-id: 5765926e4dbbc7a20d2940c098ed093b3de2204e

This splits the Tensor-dependant code out into a cpp file. A slight complicating factor is `scan_dim` using `copy_` to handle non-contiguous out arguments. So, I've moved that code into the caller which does introduce some duplication. Though it's only ~10 lines extra in total. [ghstack-poisoned]

Summary: These stable sorts currently use a combination of `at::arange`, view ops and `tensor.copy_` to fill in the initial values for the indices before calling into `CUB` to do the actual sort. This is somewhat inefficient because it requires 2 to 4 kernel launches, and the copies all use strided kernels instead of the more efficient contiguous kernels. Instead, a fairly straight-forward custom kernel is more efficient in terms of both CUDA and CPU runtime. In a simple benchmark I profiled `a.sort(stable=True, dim=1)` for different shapes and single out the kernel invocations for intitializing the index tensors (i.e. the non-`cub` kernels). Note that when the batch dim is `<128` we call `segmented_sort_pairs_by_full_sort` instead of `segmented_sort_pairs`: | shape | Master (us) | This PR (us) | |--------------|:-----------:|:------------:| | (100, 1000) | 5.000 | 2.300 | | (1000, 100) | 2.070 | 1.090 | | (100, 10000) | 87.34 | 26.47 | | (1000, 1000) | 28.63 | 20.27 | Of course for sufficiently large inputs, the overall runtime is dominated by the actual sort. But I have another motive of wanting to remove operator the calls from the middle of this kernel launch code. This change makes it easier to split the kernel code that needs to be compiled with `nvcc` into it's own file that doesn't include `Tensor.h`, similar to what I'm doing in #66620. Pull Request resolved: #66668 Reviewed By: H-Huang Differential Revision: D31693722 Pulled By: ngimel fbshipit-source-id: 5765926e4dbbc7a20d2940c098ed093b3de2204e

peterbell10 · 2021-10-20T17:40:15Z

@dagitses

This splits the Tensor-dependant code out into a cpp file. A slight complicating factor is `scan_dim` using `copy_` to handle non-contiguous out arguments. So, I've moved that code into the caller which does introduce some duplication. Though it's only ~10 lines extra in total. [ghstack-poisoned]

dagitses · 2021-10-22T09:11:17Z

@dagitses has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

dagitses

This is a great approach but I am concerned about how to communicate to other contributors that this ought to be done going forward and how to do it.

The philosophy definitely needs to be documented, but the harder part is making sure everyone knows it or it is discoverable at the right time. Do you have any better ideas than linking it extensively in CUDA implementation files?

For example, I could see something like:
#define TORCH_ASSERT_NO_OPERATORS // see link to document

dagitses · 2021-10-22T19:19:06Z

aten/src/ATen/native/cuda/ScanKernels.cpp

+  TensorArg input_arg{ self, "input", 3 };
+  checkAllSameGPU(__func__, {output_arg, indices_arg, input_arg});
+
+  auto values_ = contiguous_out_arg(values);


at head, the epilogue of this function exists only in a single body, so it's a slight step back to be duplicating it now.

We could easily templatize this implementation on the launcher.

Don't do anything about this now, maybe there's better opportunities for authoring this when looking at all the changes.

dagitses · 2021-10-22T19:22:31Z

aten/src/ATen/native/cuda/ScanKernels.cpp

+}
+
+void cumprod_cuda_kernel(const Tensor& result, const Tensor& self, int64_t dim) {
+  auto result_ = contiguous_out_arg(result);


same deal with this scaffolding.

dagitses · 2021-10-25T13:52:52Z

caffe2/CMakeLists.txt

@@ -176,7 +176,7 @@ endif()
 if(BUILD_SPLIT_CUDA)
  # Splitting the source files that'll be in torch_cuda between torch_cuda_cu and torch_cuda_cpp
  foreach(tmp ${Caffe2_GPU_SRCS})
-    if("${tmp}" MATCHES "(.*aten.*\\.cu|.*(b|B)las.*|.*((s|S)olver|Register.*CUDA|Legacy|THC|TensorShapeCUDA|BatchLinearAlgebra|ReduceOps|Equal|Activation).*\\.cpp)" AND NOT "${tmp}" MATCHES ".*(THC((CachingHost)?Allocator|General)).*")
+    if("${tmp}" MATCHES "(.*aten.*\\.cu|.*(b|B)las.*|.*((s|S)olver|Register.*CUDA|Legacy|THC|TensorShapeCUDA|BatchLinearAlgebra|ReduceOps|Equal|Activation|ScanKernels).*\\.cpp)" AND NOT "${tmp}" MATCHES ".*(THC((CachingHost)?Allocator|General)).*")


is there a way to write this conditional so that it performs efficiently and also doesn't require continuously extending a long line?

Yes, I was thinking about having a list of exceptions rather than this massive regex. I can prioritize that change for this week.

i don't think you have to prioritize it so long as we have a plan to avoid this potential maintenance problem.

This splits the Tensor-dependant code out into a cpp file. A slight complicating factor is `scan_dim` using `copy_` to handle non-contiguous out arguments. So, I've moved that code into the caller which does introduce some duplication. Though it's only ~10 lines extra in total. Differential Revision: [D31856106](https://our.internmc.facebook.com/intern/diff/D31856106) [ghstack-poisoned]

dagitses · 2021-11-02T12:44:21Z

@dagitses has imported this pull request. If you are a Facebook employee, you can view this diff on Phabricator.

facebook-github-bot · 2021-11-02T21:46:44Z

@dagitses merged this pull request in 7deb172.

pytorch-probot bot added the ciflow/default label Oct 14, 2021

peterbell10 mentioned this pull request Oct 14, 2021

Remove native_functions.yaml dependency from Sorting.cu #66621

Closed

pytorchbot added the open source label Oct 14, 2021

facebook-github-bot added the cla signed label Oct 14, 2021

peterbell10 mentioned this pull request Oct 15, 2021

sort_out_cuda: Use custom kernels to fill index tensors #66668

Closed

peterbell10 requested a review from dagitses October 15, 2021 00:07

This was referenced Oct 18, 2021

Remove native_functions.yaml dependency from Sort.cu #66793

Closed

Remove native_functions.yaml dependency from TensorTopK.cu #66794

Closed

This was referenced Oct 20, 2021

Remove native_functions.yaml dependency from TensorModeKernel.cu #66913

Closed

Remove native_functions.yaml dependency from IndexKernel.{cpp,cu} #66914

Closed

This was referenced Oct 20, 2021

Remove native_functions.yaml dependency from GridSample.{cpp,cu} #66979

Closed

Remove native_functions.yaml dependency from Sorting.cpp #66980

Closed

dagitses approved these changes Oct 25, 2021

View reviewed changes

peterbell10 mentioned this pull request Oct 25, 2021

Split cuda: list cpp files that go in _cu library explicitly #67216

Closed

facebook-github-bot closed this in 7deb172 Nov 2, 2021

facebook-github-bot added the Merged label Nov 2, 2021

peterbell10 mentioned this pull request Nov 2, 2021

Remove native_functions.yaml dependency from DistributionBernoulli.cu #67721

Closed

peterbell10 mentioned this pull request Feb 1, 2022

RFC-0025 Improving incremental builds pytorch/rfcs#39

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove native_functions.yaml dependency from ScanKernels.cu #66620

Remove native_functions.yaml dependency from ScanKernels.cu #66620

Uh oh!

Uh oh!

⚛️ CI Flow

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Remove native_functions.yaml dependency from ScanKernels.cu #66620

Remove native_functions.yaml dependency from ScanKernels.cu #66620

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

⚛️ CI Flow

Uh oh!

Uh oh!

🔗 Helpful links

💊 CI failures summary and remediations

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!