Description
This is a low priority issue.
I've come across some simple Bayesian models for which Mooncake is significantly (~4times) slower than Enzyme or an alternative, very limited, Proof-Of-Concept Julia AD method (StanBlocksAD.jl). AFAICT, Mooncake should be able to reach Enzyme's/StanBlocksAD.jl performance. It's a bit unclear to me what exactly is "dragging Mooncake down".
Furthermore, for a batched version of that model, neither Enzyme nor Mooncake achieve the same scaling as StanBlocksAD.jl. To clarify/summarize, the timings relative to the scalar StanBlocksAD.jl/Enzyme.jl timing are roughly:
BATCH_TYPE | Float64 SReal{1, Float64} SReal{2, Float64} SReal{4, Float64} SReal{8, Float64} SReal{16, Float64}
===== | ===== ===== ===== ===== ===== =====
Primal | 0.35 0.38 0.37 0.37 0.4 0.64
StanBlocksAD | 1.0 0.99 1.1 1.2 1.6 2.7
Mooncake | 4.5 4.6 4.6 12.0 21.0 35.0
Enzyme | 1.0 2.7 3.1 3.4 4.5 7.8
Notebook with (slightly different) timings and potentially reproducible code: https://nsiccha.github.io/StanBlocksAD.jl/#why
I don't intend to continue developing StanBlocksAD.jl, but I find it interesting that there are apparently still possible performance gains for something purely Julian. We can discuss what StanBlocksAD.jl does differently than Mooncake and what if anything could be ported to Mooncake. But this issue is mainly meant to record this link, and to be revisited at some later point.