Batch (some?) HAL queue operations. #20815

benvanik · 2025-05-14T23:41:53Z

alloca, and dealloca perform a single allocation and deallocation per submission. This works in most programs and is the easiest to implement throughout the stack (lacking variadic returns for new allocations, etc). Two new scenarios have arisen since then that can end up with a significant number of deallocations: tensor-parallel programs and non-gathered parameters.

A program that uses iree_io_parameter_provider_load to load 1000 parameters and repacks them during initialization will end up with 1000 dealloca operations to clean them up - each one a discrete queue submission with its own fence and a join that needs to operate over the entire set of 1000 fences. This far exceeds the design intent behind semaphores (which are more expensive to join than fork) and queue ordering and hits practical limits on some implementations (like Win32 having a max wait count of 64).

Today we skirt by such performance pitfalls and system limits by not going super wide, though it's still possible to hit the limits in reasonable programs. #20765 introduces deallocations that trigger on things like parameters and can lead to the unreasonable case getting hit.

Variadic allocation could be supported similar to how iree_io_parameter_provider_load works by providing an enumerator for each returned allocation. Unfortunately without variadic returns in the VM or C transient host memory is required to return the results back to the program. The workaround of taking in a list and then fetching the items from it has its own performance issues but would at least shift the burden to the host instead of the device as it is today with so many queue submissions.

Variadic deallocation is much simpler and likely a more common case. iree_hal_device_queue_dealloca could take a list of buffers with the same affinity and release all of them at once. It does feel odd to have an API that can do half of the combined operation as a batch and the other half not.

For now #20765 will work around this by adding a pass to split large joins into smaller chunks with queue barriers so that the situation is not hit at runtime.

The text was updated successfully, but these errors were encountered:

benvanik self-assigned this May 14, 2025

benvanik added hal/api IREE's public C hardware abstraction layer API performance ⚡ Performance/optimization related work across the compiler and runtime labels May 14, 2025

benvanik mentioned this issue May 14, 2025

Temporary automatic reference counting(ish) pass for inserting async deallocations. #20765

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Batch (some?) HAL queue operations. #20815

Batch (some?) HAL queue operations. #20815

Batch (some?) HAL queue operations. #20815

Batch (some?) HAL queue operations. #20815

Comments