8000 Batch (some?) HAL queue operations. · Issue #20815 · iree-org/iree · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Batch (some?) HAL queue operations. #20815

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
benvanik opened this issue May 14, 2025 · 0 comments
Open

Batch (some?) HAL queue operations. #20815

benvanik opened this issue May 14, 2025 · 0 comments
Assignees
Labels
hal/api IREE's public C hardware abstraction layer API performance ⚡ Performance/optimization related work across the compiler and runtime

Comments

@benvanik
Copy link
Collaborator

alloca, and dealloca perform a single allocation and deallocation per submission. This works in most programs and is the easiest to implement throughout the stack (lacking variadic returns for new allocations, etc). Two new scenarios have arisen since then that can end up with a significant number of deallocations: tensor-parallel programs and non-gathered parameters.

A program that uses iree_io_parameter_provider_load to load 1000 parameters and repacks them during initialization will end up with 1000 dealloca operations to clean them up - each one a discrete queue submission with its own fence and a join that needs to operate over the entire set of 1000 fences. This far exceeds the design intent behind semaphores (which are more expensive to join than fork) and queue ordering and hits practical limits on some implementations (like Win32 having a max wait count of 64).

Today we skirt by such performance pitfalls and system limits by not going super wide, though it's still possible to hit the limits in reasonable programs. #20765 introduces deallocations that trigger on things like parameters and can lead to the unreasonable case getting hit.

Variadic allocation could be supported similar to how iree_io_parameter_provider_load works by providing an enumerator for each returned allocation. Unfortunately without variadic returns in the VM or C transient host memory is required to return the results back to the program. The workaround of taking in a list and then fetching the items from it has its own performance issues but would at least shift the burden to the host instead of the device as it is today with so many queue submissions.

Variadic deallocation is much simpler and likely a more common case. iree_hal_device_queue_dealloca could take a list of buffers with the same affinity and release all of them at once. It does feel odd to have an API that can do half of the combined operation as a batch and the other half not.

For now #20765 will work around this by adding a pass to split large joins into smaller chunks with queue barriers so that the situation is not hit at runtime.

@benvanik benvanik self-assigned this May 14, 2025
@benvanik benvanik added hal/api IREE's public C hardware abstraction layer API performance ⚡ Performance/optimization related work across the compiler and runtime labels May 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hal/api IREE's public C hardware abstraction layer API performance ⚡ Performance/optimization related work across the compiler and runtime
Projects
None yet
Development

No branches or pull requests

1 participant
0