8000 ONNX MGP-STR model >100x slower compared to onnxruntime and >3x slower on CUDA vs CPU · Issue #20775 · iree-org/iree · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

ONNX MGP-STR model >100x slower compared to onnxruntime and >3x slower on CUDA vs CPU #20775

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
OvervCW opened this issue May 12, 2025 · 2 comments
Labels
bug 🐞 Something isn't working

Comments

@OvervCW
Copy link
OvervCW commented May 12, 2025

What happened?

We're excited about the prospect of replacing the ONNX runtime with IREE to simplify our production environment, but we are currently running into some performance challenges.

We're running a fine-tuned MGP-STR model that has been converted from PyTorch to ONNX.

We benchmark the ONNX performance using onnxruntime-gpu==1.18.1 like this:

session = InferenceSession("model.onnx", providers=["CUDAExecutionProvider"])

for _ in range(10):
    t1 = time.monotonic()
    session.run(
        None,
        {
            "input": np.random.uniform(0, 255, size=(1, 3, 32, 128)).astype(np.float32),
        },
    )
    t2 = time.monotonic()
    print(t2 - t1, "seconds")

On the NVIDIA T4 16 GB that we're using this results in an inference time of ~17 ms.

I converted this model to MLIR and then compiled it for CUDA like this:

iree-import-onnx model.onnx --opset-version 17 -o model.mlir
iree-compile --iree-opt-level=O2 --iree-hal-target-device=cuda --iree-cuda-target=turing model.mlir -o model_cuda.vmfb

I then benchmark it like this:

import time
from iree import runtime as ireert
import numpy as np
from pathlib import Path

config = ireert.Config("cuda")
ctx = ireert.SystemContext(config=config)
vm_module = ireert.VmModule.copy_buffer(ctx.instance, Path("model_cuda.vmfb").read_bytes())
ctx.add_vm_module(vm_module)

f = ctx.modules.module["main_graph"]

for _ in range(10):
    arg0 = np.random.uniform(0, 255, size=(1, 3, 32, 128)).astype(np.float32)

    t1 = time.monotonic()
    results = f(arg0)
    t2 = time.monotonic()

    print(t2 - t1, "seconds")

This code requires ~3.1 seconds to execute f() and nvidia-smi shows 100% GPU utilization while this code is running.

I compiled the same model for the CPU using:

iree-compile --iree-opt-level=O2 --iree-hal-target-device=local --iree-hal-local-target-device-backends=llvm-cpu --iree-llvmcpu-target-cpu=host model.mlir -o model_cpu.vmfb

And benchmarked it with similar code (except for using local-task as config) and the inference time averages ~0.8 seconds with that model.

Is it expected that some architectures may currently be very slow due to IREE being a work-in-progress, or are we doing something wrong? What kind of profiling can we do to get a better insight into what's going wrong?

Steps to reproduce your issue

See above.

What component(s) does this issue relate to?

No response

Version information

iree-base-compiler[onnx]==3.4.0
iree-base-runtime==3.4.0


onnxruntime-gpu==1.18.1

The host is running an NVIDIA T4 16 GB with NVIDIA driver 570.133.20 and CUDA version 12.8.

Additional context

No response

@OvervCW OvervCW added the bug 🐞 Something isn't working label May 12, 2025
@benvanik
Copy link
Collaborator

The best thing to do is setup a benchmark using iree-benchmark-module to separate python/interop from the timing and avoid measuring startup time. From there you can use tools of your choice on that binary (which is a normal C application and easily profiled) or Tracy (what we have the best support for): https://iree.dev/developers/performance/profiling-with-tracy/. That will indicate if you are bounded by general overheads or particular dispatches that may be going down unhappy paths.

The common issues are untuned target hardware (something no one has used before and is doing something silly like executing everything scalar or on a single thread), "bad" (pathologically slow, over-decomposed, or under-decomposed) input lowerings, or unexpected types (sometimes e.g. input type propagation can cause matmuls to run in f32 instead of f16 - it's good to verify expectations).

@OvervCW
Copy link
Author
OvervCW commented May 12, 2025

@benvanik Thanks for the helpful response, I'm going to do some debugging based on your suggestions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants
0