ONNX MGP-STR model >100x slower compared to onnxruntime and >3x slower on CUDA vs CPU · Issue #20775 · iree-org/iree · GitHub
More Web Proxy on the site http://driver.im/
You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We're excited about the prospect of replacing the ONNX runtime with IREE to simplify our production environment, but we are currently running into some performance challenges.
We're running a fine-tuned MGP-STR model that has been converted from PyTorch to ONNX.
We benchmark the ONNX performance using onnxruntime-gpu==1.18.1 like this:
And benchmarked it with similar code (except for using local-task as config) and the inference time averages ~0.8 seconds with that model.
Is it expected that some architectures may currently be very slow due to IREE being a work-in-progress, or are we doing something wrong? What kind of profiling can we do to get a better insight into what's going wrong?
The best thing to do is setup a benchmark using iree-benchmark-module to separate python/interop from the timing and avoid measuring startup time. From there you can use tools of your choice on that binary (which is a normal C application and easily profiled) or Tracy (what we have the best support for): https://iree.dev/developers/performance/profiling-with-tracy/. That will indicate if you are bounded by general overheads or particular dispatches that may be going down unhappy paths.
The common issues are untuned target hardware (something no one has used before and is doing something silly like executing everything scalar or on a single thread), "bad" (pathologically slow, over-decomposed, or under-decomposed) input lowerings, or unexpected types (sometimes e.g. input type propagation can cause matmuls to run in f32 instead of f16 - it's good to verify expectations).
What happened?
We're excited about the prospect of replacing the ONNX runtime with IREE to simplify our production environment, but we are currently running into some performance challenges.
We're running a fine-tuned MGP-STR model that has been converted from PyTorch to ONNX.
We benchmark the ONNX performance using
onnxruntime-gpu==1.18.1
like this:On the NVIDIA T4 16 GB that we're using this results in an inference time of ~17 ms.
I converted this model to MLIR and then compiled it for CUDA like this:
I then benchmark it like this:
This code requires ~3.1 seconds to execute
f()
andnvidia-smi
shows 100% GPU utilization while this code is running.I compiled the same model for the CPU using:
And benchmarked it with similar code (except for using
local-task
as config) and the inference time averages ~0.8 seconds with that model.Is it expected that some architectures may currently be very slow due to IREE being a work-in-progress, or are we doing something wrong? What kind of profiling can we do to get a better insight into what's going wrong?
Steps to reproduce your issue
See above.
What component(s) does this issue relate to?
No response
Version information
The host is running an NVIDIA T4 16 GB with NVIDIA driver 570.133.20 and CUDA version 12.8.
Additional context
No response
The text was updated successfully, but these errors were encountered: