8000 Add optional affinity to `stream.timepoint.join` and partitioning pass. · Issue #20817 · iree-org/iree · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Add optional affinity to stream.timepoint.join and partitioning pass. #20817

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
benvanik opened this issue May 15, 2025 · 0 comments
Open

Add optional affinity to stream.timepoint.join and partitioning pass. #20817

benvanik opened this issue May 15, 2025 · 0 comments
Assignees
Labels
compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) performance ⚡ Performance/optimization related work across the compiler and runtime

Comments

@benvanik
Copy link
Collaborator

Today stream.timepoint.join can end up with far too many operands to be practical at runtime. Even if under system limits a HAL fence created by the join and provided to subsequent operations will still be tracking all backing semaphore timepoints on each operation.

A naive approach would be to identify stream.timepoint.join ops with "a lot" of operands and split them into multiple that eventually lower to iree_hal_device_queue_barrier. This is not currently possible as joins are not expected to always be joining timepoints on the same device and timepoints do not carry a placement (and if they did, there's no guarantee it's knowable, such as when timepoints are provided as I/O).

If, however, an affinity analysis ran to identify known affinities for timepoints we could split joins into one per set of fences from the same affinity and attribute that join to that affinity. When lowering into the HAL those affinity-carrying joins would be converted to queue barriers while any remaining (mixed affinities, unknown affinities, etc) would be fence joins as they are today.

In heterogeneous situations this could help reduce the amount of cross-device-type synchronization and in pathological situations (1000's of join operands and then many subsequent waits) it would reduce the long-term performance impact of the large join operands. Queue barriers are also easier to split into multiples to avoid system limits on wait counts.

Instead of this:

%436 = stream.timepoint.join max(%2, %3, %4, %5, %6, %7, %8, %9, %10, %11, %12, %13, %14, %15, %16, %17, %18, %19, %20, %21, %22, %23, %24, %25, %26, %27, %28, %29, %30, %31, %32, %33, %34, %35, %36, %37, %38, %39, %40, %41, %42, %43, %44, %45, %46, %47, %48, %49, %50, %51, %52, %53, %54, %55, %56, %57, %58, %59, %60, %61, %62, %63, %64, %65, %66, %67, %68, %69, %70, %71, %72, %73, %74, %75, %76, %77, %78, %79, %80, %81, %82, %83, %84, %85, %86, %87, %88, %89, %90, %91, %92, %93, %94, %95, %96, %97, %98, %99, %100, %101, %102, %103, %104, %105, %106, %107, %108, %109, %110, %111, %112, %113, %114, %115, %116, %117, %118, %119, %120, %121, %122, %123, %124, %125, %126, %127, %128, %129, %130, %131, %132, %133, %134, %135, %136, %137, %138, %139, %140, %141, %142, %143, %144, %145, %146, %147, %148, %149, %150, %151, %152, %153, %154, %155, %156, %157, %158, %159, %160, %161, %162, %163, %164, %165, %166, %167, %168, %169, %170, %171, %172, %173, %174, %175, %176, %177, %178, %179, %180, %181, %182, %183, %184, %185, %186, %187, %188, %189, %190, %191, %192, %193, %194, %195, %196, %197, %198, %199, %200, %201, %202, %203, %204, %205, %206, %207, %208, %209, %210, %211, %212, %213, %214, %215, %216, %217, %218, %219, %220, %221, %222, %223, %224, %225, %226, %227, %228, %229, %230, %231, %232, %233, %234, %235, %236, %237, %238, %239, %240, %241, %242, %243, %244, %245, %246, %247, %248, %249, %250, %251, %252, %253, %254, %255, %256, %257, %258, %259, %260, %261, %262, %263, %264, %265, %266, %267, %268, %269, %270, %271, %272, %273, %274, %275, %276, %277, %278, %279, %280, %281, %282, %283, %284, %285, %286, %287, %288, %289, %290, %291, %292, %293, %294, %295, %296, %297, %298, %299, %300, %301, %302, %303, %304, %305, %306, %307, %308, %309, %310, %311, %312, %313, %314, %315, %316, %317, %318, %319, %320, %321, %322, %323, %324, %325, %326, %327, %328, %329, %330, %331, %332, %333, %334, %335, %336, %337, %338, %339, %340, %341, %342, %343, %344, %345, %346, %347, %348, %349, %350, %351, %352, %353, %354, %355, %356, %357, %358, %359, %360, %361, %362, %363, %364, %365, %366, %367, %368, %369, %370, %371, %372, %373, %374, %375, %376, %377, %378, %379, %380, %381, %382, %383, %384, %385, %386, %387, %388, %389, %390, %391, %392, %393, %394, %395, %396, %397, %398, %399, %400, %401, %402, %403, %404, %405, %406, %407, %408, %409, %410, %411, %412, %413, %414, %415, %416, %417, %418, %419, %420, %421, %422, %423, %424, %425, %426, %427, %428, %429, %430, %431, %432, %433, %434, %435) => !stream.timepoint

We should be able to partition on operand affinity:

%join_a = stream.timepoint.join on(#hal.device.affinity<@a>) max(%2, %3, %4, %5, ...)
%join_b = stream.timepoint.join on(#hal.device.affinity<@b>) max(%90, %91, %92, %93 ....)
%join_ab = stream.timepoint.join max(%join_a, %join_b)

We could also assign mixed joins to consumers, if known, when there's more than one user/global store to reduce the wait overheads:

%join_a = stream.timepoint.join on(#hal.device.affinity<@a>) max(%2, %3, %4, %5, ...)
%join_b = stream.timepoint.join on(#hal.device.affinity<@b>) max(%90, %91, %92, %93 ....)
%join_ab = stream.timepoint.join on(#hal.device.affinity<@c>) max(%join_a, %join_b)
stream.cmd.execute on(#hal.device.affinity<@c>) await(%join_ab)
stream.cmd.execute on(#hal.device.affinity<@c>) await(%join_ab)
stream.cmd.execute on(#hal.device.affinity<@c>) await(%join_ab)
stream.cmd.execute on(#hal.device.affinity<@c>) await(%join_ab)
stream.cmd.execute on(#hal.device.affinity<@c>) await(%join_ab)

And if any particular join has a large number of operands chain those:

%join_a_0 = stream.timepoint.join on(#hal.device.affinity<@a>) max(%2, %3)
%join_a_1 = stream.timepoint.join on(#hal.device.affinity<@a>) max(%join_a_0, %4, %5, ....)
...
@benvanik benvanik self-assigned this May 15, 2025
@benvanik benvanik added compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) performance ⚡ Performance/optimization related work across the compiler and runtime labels May 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
compiler/dialects Relating to the IREE compiler dialects (flow, hal, vm) performance ⚡ Performance/optimization related work across the compiler and runtime
Projects
None yet
Development

No branches or pull requests

1 participant
0