-
Notifications
You must be signed in to change notification settings - Fork 699
Correctness issue related to EmplaceAllocations #19355
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
do you have your snippets before/after the pass? the full IR is too big to parse |
Yeah, I pasted the snippets before the pass, but got lazy with making gists :P Let me cleanup the gists and get 2 sets of before and after. |
Here is a gist with two files with dumps of before and after: https://gist.github.com/qedawkins/8cb3864880c8285a203fb517d5b691e8 |
Here are both snippets when compiled to stream: https://gist.github.com/qedawkins/b6c8ebe9b99d5e5015549b7b3e13c9b3 Edit: previous example had been DCE'ing the first two dispatches, updated now. |
New set of dumps including:
https://gist.github.com/qedawkins/ee6437eb2fdb15bf7dac3791e8a88cc1 |
Ok I found something suspect looking at the dump for the whole model. stream.cmd.concurrent {
stream.cmd.copy %arg8[%c0], %arg3990[%c2560], %c131072 : !stream.resource<external>{%c131072} -> !stream.resource<transient>{%c344138752}
stream.cmd.copy %arg8[%c0], %arg3990[%c133632], %c131072 : !stream.resource<external>{%c131072} -> !stream.resource<transient>{%c344138752}
stream.cmd.dispatch @main$async_dispatch_0::@main$async_dispatch_0_elementwise_broadcast_2x160_f16xf32xf32xf32 {
ro %arg9[%c0 for %c2] : !stream.resource<external>{%c2},
ro %arg10[%c0 for %c319998976] : !stream.resource<constant>{%c319998976},
rw %arg3990[%c0 for %c344138752] : !stream.resource<transient>{%c344138752},
rw %arg3990[%c0 for %c344138752] : !stream.resource<transient>{%c344138752}
}
}
stream.cmd.concurrent {
stream.cmd.copy %arg3990[%c265984], %arg3990[%c0], %c1280 : !stream.resource<transient>{%c344138752} -> !stream.resource<transient>{%c344138752}
stream.cmd.copy %arg3990[%c264704], %arg3990[%c1280], %c1280 : !stream.resource<transient>{%c344138752} -> !stream.resource<transient>{%c344138752}
} and the outputs for dispatch 0 (arg2/arg3) look like this: func.func @main$async_dispatch_0_elementwise_broadcast_2x160_f16xf32xf32xf32(%arg0: !stream.binding {stream.alignment = 64 : index}, %arg1: !stream.binding {stream.alignment = 64 : index}, %arg2: !stream.binding {stream.alignment = 64 : index}, %arg3: !stream.binding {stream.alignment = 64 : index}) {
%c0 = arith.constant 0 : index
%c1280 = arith.constant 1280 : index
%0 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<f16>>
%1 = stream.binding.subspan %arg1[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<160xf32>>
%2 = stream.binding.subspan %arg2[%c0] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<160x2xf32>>
%3 = stream.binding.subspan %arg3[%c1280] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<160x2xf32>> This looks to me like dispatch_0 is writing to %arg3990[%c0] for %c2560 across its two outputs, and then that is immediately being copied over in the next stream.cmd.concurrent. %0 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<f16>>
%1 = stream.binding.subspan %arg1[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<160xf32>>
%2 = stream.binding.subspan %arg2[%c262144] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<160x2xf32>>
%3 = stream.binding.subspan %arg3[%c264704] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<160x2xf32>>
stream.cmd.concurrent {
stream.cmd.copy %arg3990[%c264704], %arg3990[%c263424], %c1280 : !stream.resource<transient>{%c344136192} -> !stream.resource<transient>{%c344136192}
stream.cmd.copy %arg3990[%c264704], %arg3990[%c265984], %c1280 : !stream.resource<transient>{%c344136192} -> !stream.resource<transient>{%c344136192}
stream.cmd.copy %arg3990[%c262144], %arg3990[%c267264], %c1280 : !stream.resource<transient>{%c344136192} -> !stream.resource<transient>{%c344136192}
} |
Maybe it's an issue with %11:2 = stream.async.concurrent with(%9 as %arg4037: !stream.resource<transient>{%c262144}, %arg8 as %arg4038: !stream.resource<external>{%c131072}, %arg9 as %arg4039: !stream.resource<external>{%c2}, %arg10 as %arg4040: !stream.resource<constant>{%c640}, %10 as %arg4041: !stream.resource<transient>{%c2560}) -> (%9{%c262144}, %10{%c2560}) {
%1077 = stream.async.update %arg4038, %arg4037[%c0 to %c131072] : !stream.resource<external>{%c131072} -> %arg4037 as !stream.resource<transient>{%c262144}
%1078 = stream.async.update %arg4038, %1077[%c131072 to %c262144] : !stream.resource<external>{%c131072} -> %1077 as !stream.resource<transient>{%c262144}
%1079:2 = s
8000
tream.async.dispatch @main$async_dispatch_0::@main$async_dispatch_0_elementwise_broadcast_2x160_f16xf32xf32xf32(%arg4039[%c0 to %c2 for %c2], %arg4040[%c0 to %c640 for %c640], %arg4041[%c0 to %c1280 for %c1280], %arg4041[%c1280 to %c2560 for %c1280]) : (!stream.resource<external>{%c2}, !stream.resource<constant>{%c640}, !stream.resource<transient>{%c2560}, !stream.resource<transient>{%c2560}) -> (%arg4041{%c2560}, %arg4041{%c2560})
stream.yield %1078, %1079#1 : !stream.resource<transient>{%c262144}, !stream.resource<transient>{%c2560}
}
%12 = stream.async.alloca : !stream.resource<transient>{%c2560}
%13 = stream.async.concurrent with(%12 as %arg4037: !stream.resource<transient>{%c2560}, %11#1 as %arg4038: !stream.resource<transient>{%c2560}) -> %12{%c2560} {
%1077 = stream.async.copy %arg4038[%c1280 to %c2560], %arg4037[%c0 to %c1280], %c1280 : !stream.resource<transient>{%c2560} -> %arg4037 as !stream.resource<transient>{%c2560}
%1078 = stream.async.copy %arg4038[%c0 to %c1280], %1077[%c1280 to %c2560], %c1280 : !stream.resource<transient>{%c2560} -> %1077 as !stream.resource<transient>{%c2560}
stream.yield %1078 : !stream.resource<transient>{%c2560}
} Which is writing the result of %8:1452 = stream.resource.pack on(#hal.device.affinity<@__device_0>) slices({
[2, 2] = %c2560, // #1
[0, 16] = %c262144, // #2
[1, 4] = %c2560, // #3
[3, 5] = %c2560, // #4
[5, 7] = %c1280, // #5
...
%24 = arith.addi %8#4, %c1280 : index
%25 = arith.addi %8#3, %c1280 : index
%26 = arith.addi %8#1, %c1280 : index
stream.cmd.concurrent {
stream.cmd.copy %arg8[%c0_4022], %arg4038[%8#2], %c131072 : !stream.resource<external>{%c131072} -> !stream.resource<transient>{%8#0}
stream.cmd.copy %arg8[%c0_4022], %arg4038[%27], %c131072 : !stream.resource<external>{%c131072} -> !stream.resource<transient>{%8#0}
stream.cmd.dispatch @main$async_dispatch_0::@main$async_dispatch_0_elementwise_broadcast_2x160_f16xf32xf32xf32 {
ro %arg9[%c0 for %c2] : !stream.resource<external>{%c2},
ro %arg10[%c0 for %c640] : !stream.resource<constant>{%c640},
rw %arg4038[%8#1 for %c1280] : !stream.resource<transient>{%8#0},
rw %arg4038[%26 for %c1280] : !stream.resource<transient>{%8#0}
}
}
stream.cmd.concurrent {
stream.cmd.copy %arg4038[%25], %arg4038[%8#4], %c1280 : !stream.resource<transient>{%8#0} -> !stream.resource<transient>{%8#0}
stream.cmd.copy %arg4038[%8#3], %arg4038[%24], %c1280 : !stream.resource<transient>{%8#0} -> !stream.resource<transient>{%8#0}
} Where it now appears to be copying a completely unrelated resource. Dispatch 0 writes to |
I pulled out an isolated test example: module {
util.func public @example(%arg0: !stream.timepoint, %arg1: !stream.resource<external>, %arg2: !stream.resource<external>, %arg3: !stream.resource<constant>) {
%c0 = arith.constant 0 : index
%c2 = arith.constant 2 : index
%c640 = arith.constant 640 : index
%c1280 = arith.constant 1280 : index
%c2560 = arith.constant 2560 : index
%c131072 = arith.constant 131072 : index
%c262144 = arith.constant 262144 : index
%results, %result_timepoint = stream.async.execute await(%arg0) => with(%arg1 as %arg4: !stream.resource<external>{%c131072}, %arg2 as %arg5: !stream.resource<external>{%c2}, %arg3 as %arg6: !stream.resource<constant>{%c640}) -> !stream.resource<transient>{%c2560} {
%1 = stream.async.alloca : !stream.resource<transient>{%c262144}
%2 = stream.async.alloca : !stream.resource<transient>{%c2560}
%3:2 = stream.async.concurrent with(%1 as %arg7: !stream.resource<transient>{%c262144}, %arg4 as %arg8: !stream.resource<external>{%c131072}, %arg5 as %arg9: !stream.resource<external>{%c2}, %arg6 as %arg10: !stream.resource<constant>{%c640}, %2 as %arg11: !stream.resource<transient>{%c2560}) -> (%1{%c262144}, %2{%c2560}) {
%6 = stream.async.update %arg8, %arg7[%c0 to %c131072] : !stream.resource<external>{%c131072} -> %arg7 as !stream.resource<transient>{%c262144}
%7 = stream.async.update %arg8, %6[%c131072 to %c262144] : !stream.resource<external>{%c131072} -> %6 as !stream.resource<transient>{%c262144}
%8:2 = stream.async.dispatch @main$async_dispatch_0::@main$async_dispatch_0_elementwise_broadcast_2x160_f16xf32xf32xf32(%arg9[%c0 to %c2 for %c2], %arg10[%c0 to %c640 for %c640], %arg11[%c0 to %c1280 for %c1280], %arg11[%c1280 to %c2560 for %c1280]) : (!stream.resource<external>{%c2}, !stream.resource<constant>{%c640}, !stream.resource<transient>{%c2560}, !stream.resource<transient>{%c2560}) -> (%arg11{%c2560}, %arg11{%c2560})
stream.yield %7, %8#1 : !stream.resource<transient>{%c262144}, !stream.resource<transient>{%c2560}
}
%4 = stream.async.alloca : !stream.resource<transient>{%c2560}
%5 = stream.async.concurrent with(%4 as %arg7: !stream.resource<transient>{%c2560}, %3#1 as %arg8: !stream.resource<transient>{%c2560}) -> %4{%c2560} {
%6 = stream.async.copy %arg8[%c1280 to %c2560], %arg7[%c0 to %c1280], %c1280 : !stream.resource<transient>{%c2560} -> %arg7 as !stream.resource<transient>{%c2560}
%7 = stream.async.copy %arg8[%c0 to %c1280], /// After ScheduleAllocation
module {
util.func public @example(%arg0: !stream.timepoint, %arg1: !stream.resource<external>, %arg2: !stream.resource<external>, %arg3: !stream.resource<constant>) {
%c0 = arith.constant 0 : index
%c2 = arith.constant 2 : index
%c640 = arith.constant 640 : index
%c1280 = arith.constant 1280 : index
%c2560 = arith.constant 2560 : index
%c131072 = arith.constant 131072 : index
%c262144 = arith.constant 262144 : index
%c0_0 = arith.constant 0 : index
%result, %result_timepoint = stream.resource.alloca uninitialized await(%arg0) => !stream.resource<transient>{%c2560} => !stream.timepoint
%0:4 = stream.resource.pack slices({
[2, 2] = %c2560,
[0, 2] = %c262144,
[1, 4] = %c2560
}) : index
%result_1, %result_timepoint_2 = stream.resource.alloca uninitialized await(%arg0) => !stream.resource<transient>{%0#0} => !stream.timepoint
%1 = stream.timepoint.join max(%arg0, %result_timepoint, %result_timepoint_2) => !stream.timepoint
%2 = arith.addi %0#3, %c1280 : index
%3 = arith.addi %0#1, %c1280 : index
%4 = arith.addi %0#2, %c131072 : index
%5 = stream.cmd.execute await(%1) => with(%arg1 as %arg4: !stream.resource<external>{%c131072}, %arg2 as %arg5: !stream.resource<external>{%c2}, %arg3 as %arg6: !stream.resource<constant>{%c640}, %result as %arg7: !stream.resource<transient>{%c2560}, %result_1 as %arg8: !stream.resource<transient>{%0#0}) {
stream.cmd.concurrent {
stream.cmd.copy %arg4[%c0_0], %arg8[%0#2], %c131072 : !stream.resource<external>{%c131072} -> !stream.resource<transient>{%0#0}
stream.cmd.copy %arg4[%c0_0], %arg8[%4], %c131072 : !stream.resource<external>{%c131072} -> !stream.resource<transient>{%0#0}
stream.cmd.dispatch @main$async_dispatch_0::@main$async_dispatch_0_elementwise_broadcast_2x160_f16xf32xf32xf32 {
ro %arg5[%c0 for %c2] : !stream.resource<external>{%c2},
ro %arg6[%c0 for %c640] : !stream.resource<constant>{%c640},
rw %arg8[%0#1 for %c1280] : !stream.resource<transient>{%0#0},
ro %arg8[%3 for %c1280] : !stream.resource<transient>{%0#0}
}
}
stream.cmd.concurrent {
stream.cmd.copy %arg8[%2], %arg7[%c0], %c1280 : !stream.resource<transient>{%0#0} -> !stream.resource<transient>{%c2560}
stream.cmd.copy %arg8[%0#3], %arg7[%c1280], %c1280 : !stream.resource<transient>{%0#0} -> !stream.resource<transient>{%c2560}
}
} => !stream.timepoint
%6 = stream.resource.dealloca await(%5) => %result_1 : !stream.resource<transient>{%0#0} => !stream.timepoint
%7 = stream.timepoint.join max(%6, %5) => !stream.timepoint
%8 = util.optimization_barrier %result : !stream.resource<transient>
util.return
}
} I could use a sanity check on the input here, but I think it's related to the fact that |
I think I located where the bug was introduced: https://github.com/iree-org/iree/pull/13748/files#diff-1e1b461041de3eed60748abf412f0dfec6092d1c87ca290fecee243964d3acf3R212 That patch made it so all unused transients inside concurrent regions are allocated their own buffer which effectively becomes
Here |
This patch (957ae60) revealed a correctness issue for SDXL when disabling
--iree-dispatch-creation-enable-fuse-horizontal-contractions
. Commenting outStream::EmplaceAllocationsPass
fixes the numerical issues. Additionally the results are non-deterministic but non-nan, and appear to rotate through a few possible sets of values.Dumping the IR before and after EmplaceAllocations with and without the above PR shows very minimal meaningful differences in the IR. Here is a snippet of the only differing part of the IR
Before 957ae: https://gist.github.com/qedawkins/56b861ada7c76fc7919f5fe12ca951cf // Correct
After 957ae: https://gist.github.com/qedawkins/2d6bcc8a4655330b6ac42149658f809b // Wrong
Full before and after dumps can be found here: https://gist.github.com/qedawkins/4fb2be9c832866ff5fb573bb3d92d4da + https://gist.github.com/qedawkins/316ad40b33a61096c5fb3051fdab5022
I'm not sure that EmplaceAllocations is the actual root cause of the bug, as scanning the IR by hand I could not find anything obviously wrong.
The artifacts for (f16) SDXL can be found here: https://github.com/iree-org/iree/blob/main/experimental/regression_suite/shark-test-suite-models/sdxl/test_unet.py#L107. Using the following compilation command can reproduce the full correctness issue:
The text was updated successfully, but these errors were encountered: