8000 Correctness issue related to EmplaceAllocations · Issue #19355 · iree-org/iree · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Correctness issue related to EmplaceAllocations #19355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
qedawkins opened this issue Dec 3, 2024 · 9 comments
Closed

Correctness issue related to EmplaceAllocations #19355

qedawkins opened this issue Dec 3, 2024 · 9 comments
Labels
bug 🐞 Something isn't working

Comments

@qedawkins
Copy link
Contributor

This patch (957ae60) revealed a correctness issue for SDXL when disabling --iree-dispatch-creation-enable-fuse-horizontal-contractions. Commenting out Stream::EmplaceAllocationsPass fixes the numerical issues. Additionally the results are non-deterministic but non-nan, and appear to rotate through a few possible sets of values.

Dumping the IR before and after EmplaceAllocations with and without the above PR shows very minimal meaningful differences in the IR. Here is a snippet of the only differing part of the IR
Before 957ae: https://gist.github.com/qedawkins/56b861ada7c76fc7919f5fe12ca951cf // Correct
After 957ae: https://gist.github.com/qedawkins/2d6bcc8a4655330b6ac42149658f809b // Wrong

Full before and after dumps can be found here: https://gist.github.com/qedawkins/4fb2be9c832866ff5fb573bb3d92d4da + https://gist.github.com/qedawkins/316ad40b33a61096c5fb3051fdab5022

I'm not sure that EmplaceAllocations is the actual root cause of the bug, as scanning the IR by hand I could not find anything obviously wrong.

The artifacts for (f16) SDXL can be found here: https://github.com/iree-org/iree/blob/main/experimental/regression_suite/shark-test-suite-models/sdxl/test_unet.py#L107. Using the following compilation command can reproduce the full correctness issue:

iree-compile sdxl.mlir \
--iree-hal-target-backends=rocm \
--iree-hip-target=gfx942 \
--iree-vm-bytecode-module-output-format=flatbuffer-binary \
--iree-dispatch-creation-enable-aggressive-fusion \
--iree-dispatch-creation-enable-fuse-horizontal-contractions=false \
--iree-opt-aggressively-propagate-transposes=true \
--iree-codegen-llvmgpu-use-vector-distribution=true \
--iree-opt-outer-dim-concat=true \
--iree-opt-data-tiling=false \
--iree-hip-legacy-sync=true \
--iree-codegen-gpu-native-math-precision=true \
--iree-vm-target-truncate-unsupported-floats \
--iree-global-opt-propagate-transposes=true \
--iree-opt-const-eval=false \
--iree-llvmgpu-enable-prefetch=true \
--iree-execution-model=async-external \
--iree-preprocessing-pass-pipeline="builtin.module(util.func(iree-global-opt-raise-special-ops, iree-flow-canonicalize), iree-preprocessing-transpose-convolution-pipeline, iree-preprocessing-pad-to-intrinsics, util.func(iree-preprocessing-generalize-linalg-matmul-experimental))" \
--iree-codegen-transform-dialect-library=/home/qdawkins/data/current_spec.mlir \
-o sdxl.vmfb
@qedawkins qedawkins added the bug 🐞 Something isn't working label Dec 3, 2024
@benvanik
Copy link
Collaborator
benvanik commented Dec 3, 2024

do you have your snippets before/after the pass? the full IR is too big to parse

@qedawkins
Copy link
Contributor Author

Yeah, I pasted the snippets before the pass, but got lazy with making gists :P

Let me cleanup the gists and get 2 sets of before and after.

@qedawkins
Copy link
Contributor Author

Here is a gist with two files with dumps of before and after: https://gist.github.com/qedawkins/8cb3864880c8285a203fb517d5b691e8

@qedawkins
Copy link
Contributor Author
qedawkins commented Dec 3, 2024

Here are both snippets when compiled to stream: https://gist.github.com/qedawkins/b6c8ebe9b99d5e5015549b7b3e13c9b3

Edit: previous example had been DCE'ing the first two dispatches, updated now.

@qedawkins
Copy link
Contributor Author

New set of dumps including:

  • Both the good and bad snippets with dispatches
  • Dump before/after all when compiling to stream (with dispatches)
  • Final result of --compile-to=stream (with dispatches)

https://gist.github.com/qedawkins/ee6437eb2fdb15bf7dac3791e8a88cc1

@qedawkins
Copy link
Contributor Author

Ok I found something suspect looking at the dump for the whole model.
These are the first two stream.cmd.concurrent in the failing case:

      stream.cmd.concurrent {
        stream.cmd.copy %arg8[%c0], %arg3990[%c2560], %c131072 : !stream.resource<external>{%c131072} -> !stream.resource<transient>{%c344138752}
        stream.cmd.copy %arg8[%c0], %arg3990[%c133632], %c131072 : !stream.resource<external>{%c131072} -> !stream.resource<transient>{%c344138752}
        stream.cmd.dispatch @main$async_dispatch_0::@main$async_dispatch_0_elementwise_broadcast_2x160_f16xf32xf32xf32 {
          ro %arg9[%c0 for %c2] : !stream.resource<external>{%c2},
          ro %arg10[%c0 for %c319998976] : !stream.resource<constant>{%c319998976},
          rw %arg3990[%c0 for %c344138752] : !stream.resource<transient>{%c344138752},
          rw %arg3990[%c0 for %c344138752] : !stream.resource<transient>{%c344138752}
        }
      }
      stream.cmd.concurrent {
        stream.cmd.copy %arg3990[%c265984], %arg3990[%c0], %c1280 : !stream.resource<transient>{%c344138752} -> !stream.resource<transient>{%c344138752}
        stream.cmd.copy %arg3990[%c264704], %arg3990[%c1280], %c1280 : !stream.resource<transient>{%c344138752} -> !stream.resource<transient>{%c344138752}
      }

and the outputs for dispatch 0 (arg2/arg3) look like this:

      func.func @main$async_dispatch_0_elementwise_broadcast_2x160_f16xf32xf32xf32(%arg0: !stream.binding {stream.alignment = 64 : index}, %arg1: !stream.binding {stream.alignment = 64 : index}, %arg2: !stream.binding {stream.alignment = 64 : index}, %arg3: !stream.binding {stream.alignment = 64 : index}) {
        %c0 = arith.constant 0 : index
        %c1280 = arith.constant 1280 : index
        %0 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<f16>>
        %1 = stream.binding.subspan %arg1[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<160xf32>>
        %2 = stream.binding.subspan %arg2[%c0] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<160x2xf32>>
        %3 = stream.binding.subspan %arg3[%c1280] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<160x2xf32>>

This looks to me like dispatch_0 is writing to %arg3990[%c0] for %c2560 across its two outputs, and then that is immediately being copied over in the next stream.cmd.concurrent.
The offsets look right to me in the working case:

        %0 = stream.binding.subspan %arg0[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<f16>>
        %1 = stream.binding.subspan %arg1[%c0] : !stream.binding -> !flow.dispatch.tensor<readonly:tensor<160xf32>>
        %2 = stream.binding.subspan %arg2[%c262144] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<160x2xf32>>
        %3 = stream.binding.subspan %arg3[%c264704] : !stream.binding -> !flow.dispatch.tensor<writeonly:tensor<160x2xf32>>

      stream.cmd.concurrent {
        stream.cmd.copy %arg3990[%c264704], %arg3990[%c263424], %c1280 : !stream.resource<transient>{%c344136192} -> !stream.resource<transient>{%c344136192}
        stream.cmd.copy %arg3990[%c264704], %arg3990[%c265984], %c1280 : !stream.resource<transient>{%c344136192} -> !stream.resource<transient>{%c344136192}
        stream.cmd.copy %arg3990[%c262144], %arg3990[%c267264], %c1280 : !stream.resource<transient>{%c344136192} -> !stream.resource<transient>{%c344136192}
      }

@qedawkins
Copy link
Contributor Author

Maybe it's an issue with ScheduleAllocationPass. Before it I see

      %11:2 = stream.async.concurrent with(%9 as %arg4037: !stream.resource<transient>{%c262144}, %arg8 as %arg4038: !stream.resource<external>{%c131072}, %arg9 as %arg4039: !stream.resource<external>{%c2}, %arg10 as %arg4040: !stream.resource<constant>{%c640}, %10 as %arg4041: !stream.resource<transient>{%c2560}) -> (%9{%c262144}, %10{%c2560}) {
        %1077 = stream.async.update %arg4038, %arg4037[%c0 to %c131072] : !stream.resource<external>{%c131072} -> %arg4037 as !stream.resource<transient>{%c262144}
        %1078 = stream.async.update %arg4038, %1077[%c131072 to %c262144] : !stream.resource<external>{%c131072} -> %1077 as !stream.resource<transient>{%c262144}
        %1079:2 = s
8000
tream.async.dispatch @main$async_dispatch_0::@main$async_dispatch_0_elementwise_broadcast_2x160_f16xf32xf32xf32(%arg4039[%c0 to %c2 for %c2], %arg4040[%c0 to %c640 for %c640], %arg4041[%c0 to %c1280 for %c1280], %arg4041[%c1280 to %c2560 for %c1280]) : (!stream.resource<external>{%c2}, !stream.resource<constant>{%c640}, !stream.resource<transient>{%c2560}, !stream.resource<transient>{%c2560}) -> (%arg4041{%c2560}, %arg4041{%c2560})
        stream.yield %1078, %1079#1 : !stream.resource<transient>{%c262144}, !stream.resource<transient>{%c2560}
      }
      %12 = stream.async.alloca : !stream.resource<transient>{%c2560}
      %13 = stream.async.concurrent with(%12 as %arg4037: !stream.resource<transient>{%c2560}, %11#1 as %arg4038: !stream.resource<transient>{%c2560}) -> %12{%c2560} {
        %1077 = stream.async.copy %arg4038[%c1280 to %c2560], %arg4037[%c0 to %c1280], %c1280 : !stream.resource<transient>{%c2560} -> %arg4037 as !stream.resource<transient>{%c2560}
        %1078 = stream.async.copy %arg4038[%c0 to %c1280], %1077[%c1280 to %c2560], %c1280 : !stream.resource<transient>{%c2560} -> %1077 as !stream.resource<transient>{%c2560}
        stream.yield %1078 : !stream.resource<transient>{%c2560}
      }

Which is writing the result of dispatch_0 in the subsequent copies. After scheduling allocations I see (assuming I'm reading stream.resource.pack correctly)

    %8:1452 = stream.resource.pack on(#hal.device.affinity<@__device_0>) slices({
      [2, 2] = %c2560, // #1
      [0, 16] = %c262144, // #2
      [1, 4] = %c2560, // #3
      [3, 5] = %c2560, // #4
      [5, 7] = %c1280, // #5
      ...
    %24 = arith.addi %8#4, %c1280 : index
    %25 = arith.addi %8#3, %c1280 : index
    %26 = arith.addi %8#1, %c1280 : index
      stream.cmd.concurrent {
        stream.cmd.copy %arg8[%c0_4022], %arg4038[%8#2], %c131072 : !stream.resource<external>{%c131072} -> !stream.resource<transient>{%8#0}
        stream.cmd.copy %arg8[%c0_4022], %arg4038[%27], %c131072 : !stream.resource<external>{%c131072} -> !stream.resource<transient>{%8#0}
        stream.cmd.dispatch @main$async_dispatch_0::@main$async_dispatch_0_elementwise_broadcast_2x160_f16xf32xf32xf32 {
          ro %arg9[%c0 for %c2] : !stream.resource<external>{%c2},
          ro %arg10[%c0 for %c640] : !stream.resource<constant>{%c640},
          rw %arg4038[%8#1 for %c1280] : !stream.resource<transient>{%8#0},
          rw %arg4038[%26 for %c1280] : !stream.resource<transient>{%8#0}
        }
      }
      stream.cmd.concurrent {
        stream.cmd.copy %arg4038[%25], %arg4038[%8#4], %c1280 : !stream.resource<transient>{%8#0} -> !stream.resource<transient>{%8#0}
        stream.cmd.copy %arg4038[%8#3], %arg4038[%24], %c1280 : !stream.resource<transient>{%8#0} -> !stream.resource<transient>{%8#0}
      }

Where it now appears to be copying a completely unrelated resource. Dispatch 0 writes to %8#1 and %8#1 + 1280, while the copy is copying from %8#3 and %8#3 + 1280.

@qedawkins
Copy link
Contributor Author

I pulled out an isolated test example:

module {
  util.func public @example(%arg0: !stream.timepoint, %arg1: !stream.resource<external>, %arg2: !stream.resource<external>, %arg3: !stream.resource<constant>) {
    %c0 = arith.constant 0 : index
    %c2 = arith.constant 2 : index
    %c640 = arith.constant 640 : index
    %c1280 = arith.constant 1280 : index
    %c2560 = arith.constant 2560 : index
    %c131072 = arith.constant 131072 : index
    %c262144 = arith.constant 262144 : index
    %results, %result_timepoint = stream.async.execute await(%arg0) => with(%arg1 as %arg4: !stream.resource<external>{%c131072}, %arg2 as %arg5: !stream.resource<external>{%c2}, %arg3 as %arg6: !stream.resource<constant>{%c640}) -> !stream.resource<transient>{%c2560} {
      %1 = stream.async.alloca : !stream.resource<transient>{%c262144}
      %2 = stream.async.alloca : !stream.resource<transient>{%c2560}
      %3:2 = stream.async.concurrent with(%1 as %arg7: !stream.resource<transient>{%c262144}, %arg4 as %arg8: !stream.resource<external>{%c131072}, %arg5 as %arg9: !stream.resource<external>{%c2}, %arg6 as %arg10: !stream.resource<constant>{%c640}, %2 as %arg11: !stream.resource<transient>{%c2560}) -> (%1{%c262144}, %2{%c2560}) {
        %6 = stream.async.update %arg8, %arg7[%c0 to %c131072] : !stream.resource<external>{%c131072} -> %arg7 as !stream.resource<transient>{%c262144}
        %7 = stream.async.update %arg8, %6[%c131072 to %c262144] : !stream.resource<external>{%c131072} -> %6 as !stream.resource<transient>{%c262144}
        %8:2 = stream.async.dispatch @main$async_dispatch_0::@main$async_dispatch_0_elementwise_broadcast_2x160_f16xf32xf32xf32(%arg9[%c0 to %c2 for %c2], %arg10[%c0 to %c640 for %c640], %arg11[%c0 to %c1280 for %c1280], %arg11[%c1280 to %c2560 for %c1280]) : (!stream.resource<external>{%c2}, !stream.resource<constant>{%c640}, !stream.resource<transient>{%c2560}, !stream.resource<transient>{%c2560}) -> (%arg11{%c2560}, %arg11{%c2560})
        stream.yield %7, %8#1 : !stream.resource<transient>{%c262144}, !stream.resource<transient>{%c2560}
      }
      %4 = stream.async.alloca : !stream.resource<transient>{%c2560}
      %5 = stream.async.concurrent with(%4 as %arg7: !stream.resource<transient>{%c2560}, %3#1 as %arg8: !stream.resource<transient>{%c2560}) -> %4{%c2560} {
        %6 = stream.async.copy %arg8[%c1280 to %c2560], %arg7[%c0 to %c1280], %c1280 : !stream.resource<transient>{%c2560} -> %arg7 as !stream.resource<transient>{%c2560}
        %7 = stream.async.copy %arg8[%c0 to %c1280], %6[%c1280 to %c2560], %c1280 : !stream.resource<transient>{%c2560} -> %6 as !stream.resource<transient>{%c2560}
        stream.yield %7 : !stream.resource<transient>{%c2560}
      }
      stream.yield %5 : !stream.resource<transient>{%c2560}
    } => !stream.timepoint
    %0 = util.optimization_barrier %results : !stream.resource<transient>
    util.return
  }
}
/// After ScheduleAllocation
module {
  util.func public @example(%arg0: !stream.timepoint, %arg1: !stream.resource<external>, %arg2: !stream.resource<external>, %arg3: !stream.resource<constant>) {
    %c0 = arith.constant 0 : index
    %c2 = arith.constant 2 : index
    %c640 = arith.constant 640 : index
    %c1280 = arith.constant 1280 : index
    %c2560 = arith.constant 2560 : index
    %c131072 = arith.constant 131072 : index
    %c262144 = arith.constant 262144 : index
    %c0_0 = arith.constant 0 : index
    %result, %result_timepoint = stream.resource.alloca uninitialized await(%arg0) => !stream.resource<transient>{%c2560} => !stream.timepoint
    %0:4 = stream.resource.pack slices({
      [2, 2] = %c2560,
      [0, 2] = %c262144,
      [1, 4] = %c2560
    }) : index
    %result_1, %result_timepoint_2 = stream.resource.alloca uninitialized await(%arg0) => !stream.resource<transient>{%0#0} => !stream.timepoint
    %1 = stream.timepoint.join max(%arg0, %result_timepoint, %result_timepoint_2) => !stream.timepoint
    %2 = arith.addi %0#3, %c1280 : index
    %3 = arith.addi %0#1, %c1280 : index
    %4 = arith.addi %0#2, %c131072 : index
    %5 = stream.cmd.execute await(%1) => with(%arg1 as %arg4: !stream.resource<external>{%c131072}, %arg2 as %arg5: !stream.resource<external>{%c2}, %arg3 as %arg6: !stream.resource<constant>{%c640}, %result as %arg7: !stream.resource<transient>{%c2560}, %result_1 as %arg8: !stream.resource<transient>{%0#0}) {
      stream.cmd.concurrent {
        stream.cmd.copy %arg4[%c0_0], %arg8[%0#2], %c131072 : !stream.resource<external>{%c131072} -> !stream.resource<transient>{%0#0}
        stream.cmd.copy %arg4[%c0_0], %arg8[%4], %c131072 : !stream.resource<external>{%c131072} -> !stream.resource<transient>{%0#0}
        stream.cmd.dispatch @main$async_dispatch_0::@main$async_dispatch_0_elementwise_broadcast_2x160_f16xf32xf32xf32 {
          ro %arg5[%c0 for %c2] : !stream.resource<external>{%c2},
          ro %arg6[%c0 for %c640] : !stream.resource<constant>{%c640},
          rw %arg8[%0#1 for %c1280] : !stream.resource<transient>{%0#0},
          ro %arg8[%3 for %c1280] : !stream.resource<transient>{%0#0}
        }
      }
      stream.cmd.concurrent {
        stream.cmd.copy %arg8[%2], %arg7[%c0], %c1280 : !stream.resource<transient>{%0#0} -> !stream.resource<transient>{%c2560}
        stream.cmd.copy %arg8[%0#3], %arg7[%c1280], %c1280 : !stream.resource<transient>{%0#0} -> !stream.resource<transient>{%c2560}
      }
    } => !stream.timepoint
    %6 = stream.resource.dealloca await(%5) => %result_1 : !stream.resource<transient>{%0#0} => !stream.timepoint
    %7 = stream.timepoint.join max(%6, %5) => !stream.timepoint
    %8 = util.optimization_barrier %result : !stream.resource<transient>
    util.return
  }
}

I could use a sanity check on the input here, but I think it's related to the fact that %11#0 has no users while %11#1 does and aliases it.

@qedawkins
Copy link
Contributor Author

I think I located where the bug was introduced: https://github.com/iree-org/iree/pull/13748/files#diff-1e1b461041de3eed60748abf412f0dfec6092d1c87ca290fecee243964d3acf3R212

That patch made it so all unused transients inside concurrent regions are allocated their own buffer which effectively becomes /dev/null, however this doesn't work for cases like this:

      %3:2 = stream.async.concurrent with(%1 as %arg7: !stream.resource<transient>{%c262144}, %arg4 as %arg8: !stream.resource<external>{%c131072}, %arg5 as %arg9: !stream.resource<external>{%c2}, %arg6 as %arg10: !stream.resource<constant>{%c640}, %2 as %arg11: !stream.resource<transient>{%c2560}) -> (%1{%c262144}, %2{%c2560}) {
        %6 = stream.async.update %arg8, %arg7[%c0 to %c131072] : !stream.resource<external>{%c131072} -> %arg7 as !stream.resource<transient>{%c262144}
        %7 = stream.async.update %arg8, %6[%c131072 to %c262144] : !stream.resource<external>{%c131072} -> %6 as !stream.resource<transient>{%c262144}
        %8:2 = stream.async.dispatch @main$async_dispatch_0::@main$async_dispatch_0_elementwise_broadcast_2x160_f16xf32xf32xf32(%arg9[%c0 to %c2 for %c2], %arg10[%c0 to %c640 for %c640], %arg11[%c0 to %c1280 for %c1280], %arg11[%c1280 to %c2560 for %c1280]) : (!stream.resource<external>{%c2}, !stream.resource<constant>{%c640}, !stream.resource<transient>{%c2560}, !stream.resource<transient>{%c2560}) -> (%arg11{%c2560}, %arg11{%c2560})
        stream.yield %7, %8#1 : !stream.resource<transient>{%c262144}, !stream.resource<transient>{%c2560}
      }

Here %8#0 has no users while %8#1 does. As a result, the patch linked above will override the allocation %8#1 was going to get and instead (effectively) redirects it to /dev/null (because those two results alias one another). I think this needs a stronger check that none of the value's aliases escape the allocation region, not just that it has no users. I guess it's not surprising that the suspect code is commented with // HACK :P

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants
0