8000 Tensor cores not utilised when using `iree-run-module --device=cuda` · Issue #11887 · iree-org/iree · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Tensor cores not utilised when using iree-run-module --device=cuda #11887

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
navdeepkk opened this issue Jan 18, 2023 · 15 comments
Closed

Tensor cores not utilised when using iree-run-module --device=cuda #11887

navdeepkk opened this issue Jan 18, 2023 · 15 comments
Labels
codegen/nvvm NVVM code generation compiler backend codegen Shared code generation infrastructure and dialects

Comments

@navdeepkk
Copy link
navdeepkk commented Jan 18, 2023

Hi all,

Following the instructions here (https://iree-org.github.io/iree/deployment-configurations/gpu-cuda-rocm/), I am trying to run ResNet50 using IREE command line tools downloaded via PIP. However, upon profiling the model using Nsight compute, I see that the model is not using tensor cores.

Is there a flag/env_var that needs to be set to enable tensor cores? Any suggestion would be appreciated.

Thanks

Package versions:

iree-compiler                 20220930.282
iree-runtime                  20220930.282
iree-tools-tf                 20220930.282
iree-tools-tflite             20220930.282
iree-tools-xla                20220930.282
@dcaballe dcaballe added codegen Shared code generation infrastructure and dialects awaiting-triage codegen/nvvm NVVM code generation compiler backend labels Feb 1, 2023
@ThomasRaoux
Copy link
Contributor

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op.

Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:

--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform

With all those flags set Tensorcore will be used.

@rsuderman
Copy link
Contributor
rsuderman commented Feb 1, 2023

I would recommend updating your iree installation or building from head as well. Your current version is from September 30th which would be quite out of date by this point.

@navdeepkk-polymagelabs
Copy link
navdeepkk-polymagelabs commented Feb 2, 2023

I would recommend updating your iree installation or building from head as well. Your current version is from September 30th which would be quite out of date by this point.

Thanks. We are doing a performance comparison and using the git HEAD isn't ideal. Is there a recommended most recent stable/performing commit to use?

Also, can the python packages please be updated here https://pypi.org/project/iree-tools-tf/?
Though the release date of these are shown to be Nov, the package name itself says that it is from 30/09/2022.

@navdeepkk-polymagelabs
Copy link

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op.

Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:

--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform

With all those flags set Tensorcore will be used.

Thanks. Is there a stability/performance reason these passes aren't enabled by default? The reason I'm asking is that we are doing a performance comparison and we'd like to use a uniform and standard set of flags across all models as much as possible.

@stellaraccident
Copy link
Collaborator

The target-arch level will always be required in some fashion to generate code which correctly exploits a hardware generation.

The others represent temporary passes that we added while implementing more generic/proper support for various features. Specifically:

  • CUDA implicit GEMM is moving forward now (in addition to some other projects focused on convolution performance specifically).
  • More holistic data layout and padding support is being developed now, which will get more things aligned to exploit the fast paths by default.

We don't like to enable options by default that are partial implementations that we are working to finish properly, and these each would be subsumed by active projects. There isn't anything wrong with them that we know of, and people who are using this for real work do set them. But they are not general.

@navdeepkk-polymagelabs
Copy link

The target-arch level will always be required in some fashion to generate code which correctly exploits a hardware generation.

The others represent temporary passes that we added while implementing more generic/proper support for various features. Specifically:

  • CUDA implicit GEMM is moving forward now (in addition to some other projects focused on convolution performance specifically).
  • More holistic data layout and padding support is being developed now, which will get more things aligned to exploit the fast paths by default.

We don't like to enable options by default that are partial implementations that we are working to finish properly, and these each would be subsumed by active projects. There isn't anything wrong with them that we know of, and people who are using this for real work do set them. But they are not general.

Thanks for clarifying this. Sounds good.

@navdeepkk-polymagelabs
Copy link

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op.

Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:

--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform

With all those flags set Tensorcore will be used.

Hi @ThomasRaoux I am at 6a59ff602 and these flags are absent in iree-compile. The only flags that I found with flow keyword were:

  --iree-flow-demote-f32-to-f16                                         - Converts all f32 ops and values into f16 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-f64-to-f32                                         - Converts all f64 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-i64-to-i32                                         - Converts all i64 ops and values into i32 counterparts unconditionally before main flow conversions.
  --iree-flow-dispatch-generate-workload-region                         - Generate the workload region
  --iree-flow-dispatch-use-transform-dialect=<string>                   - mlir file containing a top-level module that specifies the transformations to apply to form dispatch regions.
  --iree-flow-dump-dispatch-graph                                       - Dump a dot graph for dispatches
  --iree-flow-dump-dispatch-graph-output-file=<string>                  - Output file name for a dispatch graph dump
  --iree-flow-enable-aggressive-fusion                                  - Enable the aggressive fusion heuristic to fuse multiuse ops and ops with reduction loops
  --iree-flow-enable-data-tiling                                        - Enable data tiling path
  --iree-flow-enable-fuse-padding-into-linalg-consumer-ops              - Enable fusing tensor.pad ops into Linalg consumer ops
  --iree-flow-export-benchmark-funcs                                    - Exports one function per original module entry point and unique flow.executable that dispatches with dummy arguments.
  --iree-flow-inline-constants-max-byte-length=<int>                    - Maximum byte-length of constant that can be inlined into a dispatch region
  --iree-flow-normalize-input-indexing-map                              - Enable normalizing input indexing map to identity
  --iree-flow-promote-f16-to-f32                                        - Converts all f16 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-split-matmul-reduction=<long>                             - split ratio
  --iree-flow-topk-split-reduction=<long>                               - comma separated list of split ratios
  --iree-flow-trace-dispatch-tensors                                    - Trace runtime input/output tensors for each dispatch function.
  --iree-flow-zero-fill-empty-tensors ```

@ThomasRaoux
Copy link
Contributor

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op.
Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:

--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform

With all those flags set Tensorcore will be used.

Hi @ThomasRaoux I am at 6a59ff602 and these flags are absent in iree-compile. The only flags that I found with flow keyword were:

  --iree-flow-demote-f32-to-f16                                         - Converts all f32 ops and values into f16 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-f64-to-f32                                         - Converts all f64 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-i64-to-i32                                         - Converts all i64 ops and values into i32 counterparts unconditionally before main flow conversions.
  --iree-flow-dispatch-generate-workload-region                         - Generate the workload region
  --iree-flow-dispatch-use-transform-dialect=<string>                   - mlir file containing a top-level module that specifies the transformations to apply to form dispatch regions.
  --iree-flow-dump-dispatch-graph                                       - Dump a dot graph for dispatches
  --iree-flow-dump-dispatch-graph-output-file=<string>                  - Output file name for a dispatch graph dump
  --iree-flow-enable-aggressive-fusion                                  - Enable the aggressive fusion heuristic to fuse multiuse ops and ops with reduction loops
  --iree-flow-enable-data-tiling                                        - Enable data tiling path
  --iree-flow-enable-fuse-padding-into-linalg-consumer-ops              - Enable fusing tensor.pad ops into Linalg consumer ops
  --iree-flow-export-benchmark-funcs                                    - Exports one function per original module entry point and unique flow.executable that dispatches with dummy arguments.
  --iree-flow-inline-constants-max-byte-length=<int>                    - Maximum byte-length of constant that can be inlined into a dispatch region
  --iree-flow-normalize-input-indexing-map                              - Enable normalizing input indexing map to identity
  --iree-flow-promote-f16-to-f32                                        - Converts all f16 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-split-matmul-reduction=<long>                             - split ratio
  --iree-flow-topk-split-reduction=<long>                               - comma separated list of split ratios
  --iree-flow-trace-dispatch-tensors                                    - Trace runtime input/output tensors for each dispatch function.
  --iree-flow-zero-fill-empty-tensors ```

Those flags changed starting from yesterday's commit. Sorry for the inconvenience. What you want to use on latest iree is:
--iree-preprocessing-pass-pipeline="func.func(iree-convert-conv2d-to-img2col,iree-pad-linalg-ops{pad-size=16})"

@navdeepkk-polymagelabs
Copy link
navdeepkk-polymagelabs commented Feb 2, 2023

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op.
Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:

--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform

With all those flags set Tensorcore will be used.

Hi @ThomasRaoux I am at 6a59ff602 and these flags are absent in iree-compile. The only flags that I found with flow keyword were:

  --iree-flow-demote-f32-to-f16                                         - Converts all f32 ops and values into f16 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-f64-to-f32                                         - Converts all f64 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-i64-to-i32                                         - Converts all i64 ops and values into i32 counterparts unconditionally before main flow conversions.
  --iree-flow-dispatch-generate-workload-region                         - Generate the workload region
  --iree-flow-dispatch-use-transform-dialect=<string>                   - mlir file containing a top-level module that specifies the transformations to apply to form dispatch regions.
  --iree-flow-dump-dispatch-graph                                       - Dump a dot graph for dispatches
  --iree-flow-dump-dispatch-graph-output-file=<string>                  - Output file name for a dispatch graph dump
  --iree-flow-enable-aggressive-fusion                                  - Enable the aggressive fusion heuristic to fuse multiuse ops and ops with reduction loops
  --iree-flow-enable-data-tiling                                        - Enable data tiling path
  --iree-flow-enable-fuse-padding-into-linalg-consumer-ops              - Enable fusing tensor.pad ops into Linalg consumer ops
  --iree-flow-export-benchmark-funcs                                    - Exports one function per original module entry point and unique flow.executable that dispatches with dummy arguments.
  --iree-flow-inline-constants-max-byte-length=<int>                    - Maximum byte-length of constant that can be inlined into a dispatch region
  --iree-flow-normalize-input-indexing-map                              - Enable normalizing input indexing map to identity
  --iree-flow-promote-f16-to-f32                                        - Converts all f16 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-split-matmul-reduction=<long>                             - split ratio
  --iree-flow-topk-split-reduction=<long>                               - comma separated list of split ratios
  --iree-flow-trace-dispatch-tensors                                    - Trace runtime input/output tensors for each dispatch function.
  --iree-flow-zero-fill-empty-tensors ```

Those flags changed starting from yesterday's commit. Sorry for the inconvenience. What you want to use on latest iree is: --iree-preprocessing-pass-pipeline="func.func(iree-convert-conv2d-to-img2col,iree-pad-linalg-ops{pad-size=16})"

Hi @ThomasRaoux, thanks. This is executing now. However tensor cores are not getting used nor there is any difference in perf with or without these flags. I am running ResNet50V2 from keras. I suspect that it is due to the data type of the operands to mhlo.convolution op. I see in the mhlo input all operands are fp32. Is there a flag again that was missed and is required to do the casting of the operands to fp16 and execute in mixed-precision mode on the tensor cores?

@ThomasRaoux
Copy link
Contributor

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op.
Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:

--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform

With all those flags set Tensorcore will be used.

Hi @ThomasRaoux I am at 6a59ff602 and these flags are absent in iree-compile. The only flags that I found with flow keyword were:

  --iree-flow-demote-f32-to-f16                                         - Converts all f32 ops and values into f16 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-f64-to-f32                                         - Converts all f64 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-i64-to-i32                                         - Converts all i64 ops and values into i32 counterparts unconditionally before main flow conversions.
  --iree-flow-dispatch-generate-workload-region                         - Generate the workload region
  --iree-flow-dispatch-use-transform-dialect=<string>                   - mlir file containing a top-level module that specifies the transformations to apply to form dispatch regions.
  --iree-flow-dump-dispatch-graph                                       - Dump a dot graph for dispatches
  --iree-flow-dump-dispatch-graph-output-file=<string>                  - Output file name for a dispatch graph dump
  --iree-flow-enable-aggressive-fusion                                  - Enable the aggressive fusion heuristic to fuse multiuse ops and ops with reduction loops
  --iree-flow-enable-data-tiling                                        - Enable data tiling path
  --iree-flow-enable-fuse-padding-into-linalg-consumer-ops              - Enable fusing tensor.pad ops into Linalg consumer ops
  --iree-flow-export-benchmark-funcs                                    - Exports one function per original module entry point and unique flow.executable that dispatches with dummy arguments.
  --iree-flow-inline-constants-max-byte-length=<int>                    - Maximum byte-length of constant that can be inlined into a dispatch region
  --iree-flow-normalize-input-indexing-map                              - Enable normalizing input indexing map to identity
  --iree-flow-promote-f16-to-f32                                        - Converts all f16 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-split-matmul-reduction=<long>                             - split ratio
  --iree-flow-topk-split-reduction=<long>                               - comma separated list of split ratios
  --iree-flow-trace-dispatch-tensors                                    - Trace runtime input/output tensors for each dispatch function.
  --iree-flow-zero-fill-empty-tensors ```

Those flags changed starting from yesterday's commit. Sorry for the inconvenience. What you want to use on latest iree is: --iree-preprocessing-pass-pipeline="func.func(iree-convert-conv2d-to-img2col,iree-pad-linalg-ops{pad-size=16})"

Hi @ThomasRaoux, thanks. This is executing now. However tensor cores are not getting used nor there is any difference in perf with or without these flags. I am running ResNet50V2 from keras. I suspect that it is due to the data type of the operands to mhlo.convolution op. I see in the mhlo input all operands are fp32. Is there a flag again that was missed and is required to do the casting of the operands to fp16 and execute in mixed-precision mode on the tensor cores?

We don't have a flag to automatically demote operands from fp32 to fp16. Could you share the mhlo IR?

@navdeepkk-polymagelabs
Copy link

Hi @ThomasRaoux, I am attaching two files one with the default tensorflow precision policy and the other with mixed-float16 precision policy. The one with mixed-float16 policy fails to lower with the following error:-

iree_model.mlir:285:22: error: expected SSA operand
    %4 = mhlo.convert(%arg0) : (tensor<1x224x224x3xf32>) -> tensor<1x224x224x3xf16>

Default policy
mixed-float16

@navdeepkk-polymagelabs
Copy link
navdeepkk-polymagelabs commented Feb 7, 2023

Hi @ThomasRaoux, I am attaching two files one with the default tensorflow precision policy and the other with mixed-float16 precision policy. The one with mixed-float16 policy fails to lower with the following error:-

iree_model.mlir:285:22: error: expected SSA operand
    %4 = mhlo.convert(%arg0) : (tensor<1x224x224x3xf32>) -> tensor<1x224x224x3xf16>

Default policy mixed-float16

Hi @ThomasRaoux, is there a way I can still use tensor cores for this IR?

@github-project-automation github-project-automation bot moved this to Not Started in (Deprecated) IREE Feb 21, 2023
@julianwa julianwa moved this from Not Started to Inbox in (Deprecated) IREE Apr 5, 2023
@allieculp
Copy link

@ThomasRaoux Bumping this up, can you take a look?

@ThomasRaoux
Copy link
Contributor

@ThomasRaoux Bumping this up, can you take a look?

Sorry for missing this issue. We need implicit gemm support for this to happen.

@allieculp
Copy link

@ThomasRaoux @mattwalsh Setting as a P2 since we don't yet have the implicit gemm support for this - please bump up when needed.

@allieculp allieculp moved this from Inbox to Needs Scheduling in (Deprecated) IREE Apr 18, 2023
@ThomasRaoux ThomasRaoux removed their assignment Mar 6, 2024
@benvanik benvanik closed this as not planned Won't fix, can't repro, duplicate, stale Apr 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
codegen/nvvm NVVM code generation compiler backend codegen Shared code generation infrastructure and dialects
Projects
No open projects
Status: No status
Development

No branches or pull requests

9 participants
0