Add optional affinity to stream.timepoint.join
and partitioning pass.
#20817
Labels
compiler/dialects
Relating to the IREE compiler dialects (flow, hal, vm)
performance ⚡
Performance/optimization related work across the compiler and runtime
Today
stream.timepoint.join
can end up with far too many operands to be practical at runtime. Even if under system limits a HAL fence created by the join and provided to subsequent operations will still be tracking all backing semaphore timepoints on each operation.A naive approach would be to identify
stream.timepoint.join
ops with "a lot" of operands and split them into multiple that eventually lower toiree_hal_device_queue_barrier
. This is not currently possible as joins are not expected to always be joining timepoints on the same device and timepoints do not carry a placement (and if they did, there's no guarantee it's knowable, such as when timepoints are provided as I/O).If, however, an affinity analysis ran to identify known affinities for timepoints we could split joins into one per set of fences from the same affinity and attribute that join to that affinity. When lowering into the HAL those affinity-carrying joins would be converted to queue barriers while any remaining (mixed affinities, unknown affinities, etc) would be fence joins as they are today.
In heterogeneous situations this could help reduce the amount of cross-device-type synchronization and in pathological situations (1000's of join operands and then many subsequent waits) it would reduce the long-term performance impact of the large join operands. Queue barriers are also easier to split into multiples to avoid system limits on wait counts.
Instead of this:
We should be able to partition on operand affinity:
We could also assign mixed joins to consumers, if known, when there's more than one user/global store to reduce the wait overheads:
And if any particular join has a large number of operands chain those:
The text was updated successfully, but these errors were encountered: