Tensor cores not utilised when using `iree-run-module --device=cuda` #11887

navdeepkk · 2023-01-18T12:49:37Z

Hi all,

Following the instructions here (https://iree-org.github.io/iree/deployment-configurations/gpu-cuda-rocm/), I am trying to run ResNet50 using IREE command line tools downloaded via PIP. However, upon profiling the model using Nsight compute, I see that the model is not using tensor cores.

Is there a flag/env_var that needs to be set to enable tensor cores? Any suggestion would be appreciated.

Thanks

Package versions:

iree-compiler                 20220930.282
iree-runtime                  20220930.282
iree-tools-tf                 20220930.282
iree-tools-tflite             20220930.282
iree-tools-xla                20220930.282

The text was updated successfully, but these errors were encountered:

ThomasRaoux · 2023-02-01T17:12:46Z

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op.

Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:

--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform

With all those flags set Tensorcore will be used.

rsuderman · 2023-02-01T19:51:28Z

I would recommend updating your iree installation or building from head as well. Your current version is from September 30th which would be quite out of date by this point.

navdeepkk-polymagelabs · 2023-02-02T02:31:53Z

I would recommend updating your iree installation or building from head as well. Your current version is from September 30th which would be quite out of date by this point.

Thanks. We are doing a performance comparison and using the git HEAD isn't ideal. Is there a recommended most recent stable/performing commit to use?

Also, can the python packages please be updated here https://pypi.org/project/iree-tools-tf/?
Though the release date of these are shown to be Nov, the package name itself says that it is from 30/09/2022.

navdeepkk-polymagelabs · 2023-02-02T02:39:53Z

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op.

Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:
--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform
With all those flags set Tensorcore will be used.

Thanks. Is there a stability/performance reason these passes aren't enabled by default? The reason I'm asking is that we are doing a performance comparison and we'd like to use a uniform and standard set of flags across all models as much as possible.

stellaraccident · 2023-02-02T03:07:02Z

The target-arch level will always be required in some fashion to generate code which correctly exploits a hardware generation.

The others represent temporary passes that we added while implementing more generic/proper support for various features. Specifically:

CUDA implicit GEMM is moving forward now (in addition to some other projects focused on convolution performance specifically).
More holistic data layout and padding support is being developed now, which will get more things aligned to exploit the fast paths by default.

We don't like to enable options by default that are partial implementations that we are working to finish properly, and these each would be subsumed by active projects. There isn't anything wrong with them that we know of, and people who are using this for real work do set them. But they are not general.

navdeepkk-polymagelabs · 2023-02-02T03:11:37Z

The target-arch level will always be required in some fashion to generate code which correctly exploits a hardware generation.

The others represent temporary passes that we added while implementing more generic/proper support for various features. Specifically:

CUDA implicit GEMM is moving forward now (in addition to some other projects focused on convolution performance specifically).

More holistic data layout and padding support is being developed now, which will get more things aligned to exploit the fast paths by default.

We don't like to enable options by default that are partial implementations that we are working to finish properly, and these each would be subsumed by active projects. There isn't anything wrong with them that we know of, and people who are using this for real work do set them. But they are not general.

Thanks for clarifying this. Sounds good.

navdeepkk-polymagelabs · 2023-02-02T06:52:50Z

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op.

Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:
--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform
With all those flags set Tensorcore will be used.

Hi @ThomasRaoux I am at 6a59ff602 and these flags are absent in iree-compile. The only flags that I found with flow keyword were:

  --iree-flow-demote-f32-to-f16                                         - Converts all f32 ops and values into f16 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-f64-to-f32                                         - Converts all f64 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-i64-to-i32                                         - Converts all i64 ops and values into i32 counterparts unconditionally before main flow conversions.
  --iree-flow-dispatch-generate-workload-region                         - Generate the workload region
  --iree-flow-dispatch-use-transform-dialect=<string>                   - mlir file containing a top-level module that specifies the transformations to apply to form dispatch regions.
  --iree-flow-dump-dispatch-graph                                       - Dump a dot graph for dispatches
  --iree-flow-dump-dispatch-graph-output-file=<string>                  - Output file name for a dispatch graph dump
  --iree-flow-enable-aggressive-fusion                                  - Enable the aggressive fusion heuristic to fuse multiuse ops and ops with reduction loops
  --iree-flow-enable-data-tiling                                        - Enable data tiling path
  --iree-flow-enable-fuse-padding-into-linalg-consumer-ops              - Enable fusing tensor.pad ops into Linalg consumer ops
  --iree-flow-export-benchmark-funcs                                    - Exports one function per original module entry point and unique flow.executable that dispatches with dummy arguments.
  --iree-flow-inline-constants-max-byte-length=<int>                    - Maximum byte-length of constant that can be inlined into a dispatch region
  --iree-flow-normalize-input-indexing-map                              - Enable normalizing input indexing map to identity
  --iree-flow-promote-f16-to-f32                                        - Converts all f16 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-split-matmul-reduction=<long>                             - split ratio
  --iree-flow-topk-split-reduction=<long>                               - comma separated list of split ratios
  --iree-flow-trace-dispatch-tensors                                    - Trace runtime input/output tensors for each dispatch function.
  --iree-flow-zero-fill-empty-tensors ```

ThomasRaoux · 2023-02-02T07:03:33Z

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op.
Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:
--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform
With all those flags set Tensorcore will be used.

Hi @ThomasRaoux I am at 6a59ff602 and these flags are absent in iree-compile. The only flags that I found with flow keyword were:

  --iree-flow-demote-f32-to-f16                                         - Converts all f32 ops and values into f16 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-f64-to-f32                                         - Converts all f64 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-i64-to-i32                                         - Converts all i64 ops and values into i32 counterparts unconditionally before main flow conversions.
  --iree-flow-dispatch-generate-workload-region                         - Generate the workload region
  --iree-flow-dispatch-use-transform-dialect=<string>                   - mlir file containing a top-level module that specifies the transformations to apply to form dispatch regions.
  --iree-flow-dump-dispatch-graph                                       - Dump a dot graph for dispatches
  --iree-flow-dump-dispatch-graph-output-file=<string>                  - Output file name for a dispatch graph dump
  --iree-flow-enable-aggressive-fusion                                  - Enable the aggressive fusion heuristic to fuse multiuse ops and ops with reduction loops
  --iree-flow-enable-data-tiling                                        - Enable data tiling path
  --iree-flow-enable-fuse-padding-into-linalg-consumer-ops              - Enable fusing tensor.pad ops into Linalg consumer ops
  --iree-flow-export-benchmark-funcs                                    - Exports one function per original module entry point and unique flow.executable that dispatches with dummy arguments.
  --iree-flow-inline-constants-max-byte-length=<int>                    - Maximum byte-length of constant that can be inlined into a dispatch region
  --iree-flow-normalize-input-indexing-map                              - Enable normalizing input indexing map to identity
  --iree-flow-promote-f16-to-f32                                        - Converts all f16 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-split-matmul-reduction=<long>                             - split ratio
  --iree-flow-topk-split-reduction=<long>                               - comma separated list of split ratios
  --iree-flow-trace-dispatch-tensors                                    - Trace runtime input/output tensors for each dispatch function.
  --iree-flow-zero-fill-empty-tensors ```

Those flags changed starting from yesterday's commit. Sorry for the inconvenience. What you want to use on latest iree is:
--iree-preprocessing-pass-pipeline="func.func(iree-convert-conv2d-to-img2col,iree-pad-linalg-ops{pad-size=16})"

navdeepkk-polymagelabs · 2023-02-02T09:08:17Z

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op.
Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:
--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform
With all those flags set Tensorcore will be used.

Hi @ThomasRaoux I am at 6a59ff602 and these flags are absent in iree-compile. The only flags that I found with flow keyword were:

  --iree-flow-demote-f32-to-f16                                         - Converts all f32 ops and values into f16 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-f64-to-f32                                         - Converts all f64 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-i64-to-i32                                         - Converts all i64 ops and values into i32 counterparts unconditionally before main flow conversions.
  --iree-flow-dispatch-generate-workload-region                         - Generate the workload region
  --iree-flow-dispatch-use-transform-dialect=<string>                   - mlir file containing a top-level module that specifies the transformations to apply to form dispatch regions.
  --iree-flow-dump-dispatch-graph                                       - Dump a dot graph for dispatches
  --iree-flow-dump-dispatch-graph-output-file=<string>                  - Output file name for a dispatch graph dump
  --iree-flow-enable-aggressive-fusion                                  - Enable the aggressive fusion heuristic to fuse multiuse ops and ops with reduction loops
  --iree-flow-enable-data-tiling                                        - Enable data tiling path
  --iree-flow-enable-fuse-padding-into-linalg-consumer-ops              - Enable fusing tensor.pad ops into Linalg consumer ops
  --iree-flow-export-benchmark-funcs                                    - Exports one function per original module entry point and unique flow.executable that dispatches with dummy arguments.
  --iree-flow-inline-constants-max-byte-length=<int>                    - Maximum byte-length of constant that can be inlined into a dispatch region
  --iree-flow-normalize-input-indexing-map                              - Enable normalizing input indexing map to identity
  --iree-flow-promote-f16-to-f32                                        - Converts all f16 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-split-matmul-reduction=<long>                             - split ratio
  --iree-flow-topk-split-reduction=<long>                               - comma separated list of split ratios
  --iree-flow-trace-dispatch-tensors                                    - Trace runtime input/output tensors for each dispatch function.
  --iree-flow-zero-fill-empty-tensors ```

Those flags changed starting from yesterday's commit. Sorry for the inconvenience. What you want to use on latest iree is: --iree-preprocessing-pass-pipeline="func.func(iree-convert-conv2d-to-img2col,iree-pad-linalg-ops{pad-size=16})"

Hi @ThomasRaoux, thanks. This is executing now. However tensor cores are not getting used nor there is any difference in perf with or without these flags. I am running ResNet50V2 from keras. I suspect that it is due to the data type of the operands to mhlo.convolution op. I see in the mhlo input all operands are fp32. Is there a flag again that was missed and is required to do the casting of the operands to fp16 and execute in mixed-precision mode on the tensor cores?

ThomasRaoux · 2023-02-02T17:52:10Z

To use tensorcore you need to pass the right target as IREE is meant for cross compilation and doesn't query the target. On A100 you need to set --iree-hal-cuda-llvm-target-arch=sm_80 for matmul op.
Currently convolution don't have a codegen using tensorcore by default, in order for it to happen you need to set those flags to convert conv to matmul:
--iree-flow-enable-padding-linalg-ops --iree-flow-linalg-ops-padding-size=32 
--iree-flow-enable-conv-img2col-transform
With all those flags set Tensorcore will be used.
Hi @ThomasRaoux I am at 6a59ff602 and these flags are absent in iree-compile. The only flags that I found with flow keyword were:
  --iree-flow-demote-f32-to-f16                                         - Converts all f32 ops and values into f16 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-f64-to-f32                                         - Converts all f64 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-demote-i64-to-i32                                         - Converts all i64 ops and values into i32 counterparts unconditionally before main flow conversions.
  --iree-flow-dispatch-generate-workload-region                         - Generate the workload region
  --iree-flow-dispatch-use-transform-dialect=<string>                   - mlir file containing a top-level module that specifies the transformations to apply to form dispatch regions.
  --iree-flow-dump-dispatch-graph                                       - Dump a dot graph for dispatches
  --iree-flow-dump-dispatch-graph-output-file=<string>                  - Output file name for a dispatch graph dump
  --iree-flow-enable-aggressive-fusion                                  - Enable the aggressive fusion heuristic to fuse multiuse ops and ops with reduction loops
  --iree-flow-enable-data-tiling                                        - Enable data tiling path
  --iree-flow-enable-fuse-padding-into-linalg-consumer-ops              - Enable fusing tensor.pad ops into Linalg consumer ops
  --iree-flow-export-benchmark-funcs                                    - Exports one function per original module entry point and unique flow.executable that dispatches with dummy arguments.
  --iree-flow-inline-constants-max-byte-length=<int>                    - Maximum byte-length of constant that can be inlined into a dispatch region
  --iree-flow-normalize-input-indexing-map                              - Enable normalizing input indexing map to identity
  --iree-flow-promote-f16-to-f32                                        - Converts all f16 ops and values into f32 counterparts unconditionally before main flow conversions.
  --iree-flow-split-matmul-reduction=<long>                             - split ratio
  --iree-flow-topk-split-reduction=<long>                               - comma separated list of split ratios
  --iree-flow-trace-dispatch-tensors                                    - Trace runtime input/output tensors for each dispatch function.
  --iree-flow-zero-fill-empty-tensors ```
Those flags changed starting from yesterday's commit. Sorry for the inconvenience. What you want to use on latest iree is: --iree-preprocessing-pass-pipeline="func.func(iree-convert-conv2d-to-img2col,iree-pad-linalg-ops{pad-size=16})"
Hi @ThomasRaoux, thanks. This is executing now. However tensor cores are not getting used nor there is any difference in perf with or without these flags. I am running ResNet50V2 from keras. I suspect that it is due to the data type of the operands to mhlo.convolution op. I see in the mhlo input all operands are fp32. Is there a flag again that was missed and is required to do the casting of the operands to fp16 and execute in mixed-precision mode on the tensor cores?

We don't have a flag to automatically demote operands from fp32 to fp16. Could you share the mhlo IR?

navdeepkk-polymagelabs · 2023-02-03T02:57:40Z

Hi @ThomasRaoux, I am attaching two files one with the default tensorflow precision policy and the other with mixed-float16 precision policy. The one with mixed-float16 policy fails to lower with the following error:-

iree_model.mlir:285:22: error: expected SSA operand
    %4 = mhlo.convert(%arg0) : (tensor<1x224x224x3xf32>) -> tensor<1x224x224x3xf16>

Default policy
mixed-float16

navdeepkk-polymagelabs · 2023-02-07T04:25:40Z

Hi @ThomasRaoux, I am attaching two files one with the default tensorflow precision policy and the other with mixed-float16 precision policy. The one with mixed-float16 policy fails to lower with the following error:-
iree_model.mlir:285:22: error: expected SSA operand
    %4 = mhlo.convert(%arg0) : (tensor<1x224x224x3xf32>) -> tensor<1x224x224x3xf16>
Default policy mixed-float16

Hi @ThomasRaoux, is there a way I can still use tensor cores for this IR?

allieculp · 2023-04-11T20:32:57Z

@ThomasRaoux Bumping this up, can you take a look?

ThomasRaoux · 2023-04-11T21:11:38Z

@ThomasRaoux Bumping this up, can you take a look?

Sorry for missing this issue. We need implicit gemm support for this to happen.

allieculp · 2023-04-18T20:26:36Z

@ThomasRaoux @mattwalsh Setting as a P2 since we don't yet have the implicit gemm support for this - please bump up when needed.

iree-github-actions-bot added this to IREE Jan 18, 2023

dcaballe assigned ThomasRaoux Feb 1, 2023

dcaballe added codegen Shared code generation infrastructure and dialects awaiting-triage codegen/nvvm NVVM code generation compiler backend labels Feb 1, 2023

GMNGeoffrey added this to (Deprecated) IREE Feb 21, 2023

github-project-automation bot moved this to Not Started in (Deprecated) IREE Feb 21, 2023

julianwa moved this from Not Started to Inbox in (Deprecated) IREE Apr 5, 2023

julianwa removed the awaiting-triage label Apr 5, 2023

allieculp moved this from Inbox to Needs Scheduling in (Deprecated) IREE Apr 18, 2023

ThomasRaoux removed their assignment Mar 6, 2024

benvanik closed this as not planned Won't fix, can't repro, duplicate, stale Apr 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tensor cores not utilised when using `iree-run-module --device=cuda` #11887

Tensor cores not utilised when using `iree-run-module --device=cuda` #11887

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tensor cores not utilised when using iree-run-module --device=cuda #11887

Tensor cores not utilised when using iree-run-module --device=cuda #11887

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Tensor cores not utilised when using `iree-run-module --device=cuda` #11887

Tensor cores not utilised when using `iree-run-module --device=cuda` #11887