Batch (some?) HAL queue operations. #20815
Labels
hal/api
IREE's public C hardware abstraction layer API
performance ⚡
Performance/optimization related work across the compiler and runtime
alloca
, anddealloca
perform a single allocation and deallocation per submission. This works in most programs and is the easiest to implement throughout the stack (lacking variadic returns for new allocations, etc). Two new scenarios have arisen since then that can end up with a significant number of deallocations: tensor-parallel programs and non-gathered parameters.A program that uses
iree_io_parameter_provider_load
to load 1000 parameters and repacks them during initialization will end up with 1000 dealloca operations to clean them up - each one a discrete queue submission with its own fence and a join that needs to operate over the entire set of 1000 fences. This far exceeds the design intent behind semaphores (which are more expensive to join than fork) and queue ordering and hits practical limits on some implementations (like Win32 having a max wait count of 64).Today we skirt by such performance pitfalls and system limits by not going super wide, though it's still possible to hit the limits in reasonable programs. #20765 introduces deallocations that trigger on things like parameters and can lead to the unreasonable case getting hit.
Variadic allocation could be supported similar to how
iree_io_parameter_provider_load
works by providing an enumerator for each returned allocation. Unfortunately without variadic returns in the VM or C transient host memory is required to return the results back to the program. The workaround of taking in a list and then fetching the items from it has its own performance issues but would at least shift the burden to the host instead of the device as it is today with so many queue submissions.Variadic deallocation is much simpler and likely a more common case.
iree_hal_device_queue_dealloca
could take a list of buffers with the same affinity and release all of them at once. It does feel odd to have an API that can do half of the combined operation as a batch and the other half not.For now #20765 will work around this by adding a pass to split large joins into smaller chunks with queue barriers so that the situation is not hit at runtime.
The text was updated successfully, but these errors were encountered: