ONNX MGP-STR model >100x slower compared to onnxruntime and >3x slower on CUDA vs CPU #20775

OvervCW · 2025-05-12T08:14:40Z

What happened?

We're excited about the prospect of replacing the ONNX runtime with IREE to simplify our production environment, but we are currently running into some performance challenges.

We're running a fine-tuned MGP-STR model that has been converted from PyTorch to ONNX.

We benchmark the ONNX performance using onnxruntime-gpu==1.18.1 like this:

session = InferenceSession("model.onnx", providers=["CUDAExecutionProvider"])

for _ in range(10):
    t1 = time.monotonic()
    session.run(
        None,
        {
            "input": np.random.uniform(0, 255, size=(1, 3, 32, 128)).astype(np.float32),
        },
    )
    t2 = time.monotonic()
    print(t2 - t1, "seconds")

On the NVIDIA T4 16 GB that we're using this results in an inference time of ~17 ms.

I converted this model to MLIR and then compiled it for CUDA like this:

iree-import-onnx model.onnx --opset-version 17 -o model.mlir
iree-compile --iree-opt-level=O2 --iree-hal-target-device=cuda --iree-cuda-target=turing model.mlir -o model_cuda.vmfb

I then benchmark it like this:

import time
from iree import runtime as ireert
import numpy as np
from pathlib import Path

config = ireert.Config("cuda")
ctx = ireert.SystemContext(config=config)
vm_module = ireert.VmModule.copy_buffer(ctx.instance, Path("model_cuda.vmfb").read_bytes())
ctx.add_vm_module(vm_module)

f = ctx.modules.module["main_graph"]

for _ in range(10):
    arg0 = np.random.uniform(0, 255, size=(1, 3, 32, 128)).astype(np.float32)

    t1 = time.monotonic()
    results = f(arg0)
    t2 = time.monotonic()

    print(t2 - t1, "seconds")

This code requires ~3.1 seconds to execute f() and nvidia-smi shows 100% GPU utilization while this code is running.

I compiled the same model for the CPU using:

iree-compile --iree-opt-level=O2 --iree-hal-target-device=local --iree-hal-local-target-device-backends=llvm-cpu --iree-llvmcpu-target-cpu=host model.mlir -o model_cpu.vmfb

And benchmarked it with similar code (except for using local-task as config) and the inference time averages ~0.8 seconds with that model.

Is it expected that some architectures may currently be very slow due to IREE being a work-in-progress, or are we doing something wrong? What kind of profiling can we do to get a better insight into what's going wrong?

Steps to reproduce your issue

See above.

What component(s) does this issue relate to?

No response

Version information

iree-base-compiler[onnx]==3.4.0
iree-base-runtime==3.4.0


onnxruntime-gpu==1.18.1

The host is running an NVIDIA T4 16 GB with NVIDIA driver 570.133.20 and CUDA version 12.8.

Additional context

No response

The text was updated successfully, but these errors were encountered:

benvanik · 2025-05-12T14:28:51Z

The best thing to do is setup a benchmark using iree-benchmark-module to separate python/interop from the timing and avoid measuring startup time. From there you can use tools of your choice on that binary (which is a normal C application and easily profiled) or Tracy (what we have the best support for): https://iree.dev/developers/performance/profiling-with-tracy/. That will indicate if you are bounded by general overheads or particular dispatches that may be going down unhappy paths.

The common issues are untuned target hardware (something no one has used before and is doing something silly like executing everything scalar or on a single thread), "bad" (pathologically slow, over-decomposed, or under-decomposed) input lowerings, or unexpected types (sometimes e.g. input type propagation can cause matmuls to run in f32 instead of f16 - it's good to verify expectations).

OvervCW · 2025-05-12T14:56:56Z

@benvanik Thanks for the helpful response, I'm going to do some debugging based on your suggestions.

OvervCW added the bug 🐞 Something isn't working label May 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

ONNX MGP-STR model >100x slower compared to onnxruntime and >3x slower on CUDA vs CPU #20775

ONNX MGP-STR model >100x slower compared to onnxruntime and >3x slower on CUDA vs CPU #20775

Uh oh!

Uh oh!

ONNX MGP-STR model >100x slower compared to onnxruntime and >3x slower on CUDA vs CPU #20775

ONNX MGP-STR model >100x slower compared to onnxruntime and >3x slower on CUDA vs CPU #20775

Comments

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

Uh oh!

Uh oh!