Closed
Description
Hey @yaoyaoding!
First of all, amazing work with Hidet!
I have recently been experimenting with hidet to see if it can outperform ORT.
Surprisingly, ORT with IO binding on an ONNX graph(BART, Pegasus, GPT2) without any graph optimisations outperforms the hidet's optimised flow graph even with a search space 2. (on Nvidia A100)
Did you previously run any benchmark comparisons between hidet and ORT? I would love to help debug this!
Also, I have experimented with transformer-deploy, which performs better than vanilla ORT and hidet. Replicating optimisations from transformer-deploy is a good next step. I would love to help with this as well!