8000 [Integrate] Upstream narrow type emulation is breaking iree test · Issue #20645 · iree-org/iree · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[Integrate] Upstream narrow type emulation is breaking iree test #20645 8000

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
raikonenfnu opened this issue Apr 26, 2025 · 16 comments
Open

[Integrate] Upstream narrow type emulation is breaking iree test #20645

raikonenfnu opened this issue Apr 26, 2025 · 16 comments
Assignees
Labels
bug 🐞 Something isn't working codegen Shared code generation infrastructure and dialects enhancement ➕ New feature or request onboarding/codegen Tasks suitable for new team member onboarding

Comments

@raikonenfnu
Copy link
Collaborator

What happened?

llvm/llvm-project#133231 seems to be breaking our subbyte emulation tests.

A simple repro is https://gist.github.com/raikonenfnu/1ea07b7e231d8997bfa1c29502df637d

original test https://github.com/iree-org/iree/blob/1a8d229431e62b50eb297e75ca4bf1dba3b67f65/tests/e2e/linalg/fp_to_subbyte.mlir

Based on the above test, currently IREE is generating code that has different trailing dim and "virtually"/fake non constant offset into the store vector.store %0[%2] where %2 is actually a constant, but since it is inside a branch region, %c0 becomes %2 which is a region argument and is now a non-constant.

The condition above leads our program into this code path where it cannot determine !foldedNumFrontPadElems and fails this lowering. as seen in:
https://github.com/llvm/llvm-project/blob/2de936b6eb38e7a37224a97c2a22aa79b9dfb9dc/mlir/lib/Dialect/Vector/Transforms/VectorEmulateNarrowType.cpp#L619-L629

IIUC, that PR is somewhat correct, as long as the trailing dim does not match, we may need partial stores.

I don't think I have enough context on the best way to solve this. We can:

  1. Fold away that region since the condition to the regions are constants or somehow constant fold S.T we have constant offset in vector.store
  2. Add some code in vector dialect's EmulateNarrowType vector::StoreOp conversion to handle non constant cases through series of bitcasting and memref.generic_atomic_rmws. (this seems much harder and require some more thinking)

Steps to reproduce your issue

  1. wget https://gist.githubusercontent.com/raikonenfnu/1ea07b7e231d8997bfa1c29502df637d/raw/6e3311f6794dd5afeaeac0f82422cbe498d67272/emulate_failure.mlir

  2. iree-opt --iree-codegen-emulate-narrow-type emulate_failure.mlir -o out.mlir

emulate.mlir:18:3: error: failed to legalize operation 'vector.store' that was explicitly marked illegal

What component(s) does this issue relate to?

No response

Version information

No response

Additional context

No response

@raikonenfnu raikonenfnu added the bug 🐞 Something isn't working label Apr 26, 2025
raikonenfnu added a commit to raikonenfnu/iree that referenced this issue Apr 26, 2025
Updated LIT test from landing llvm/llvm-project#136640
which folds linalg.index when size is unit dim (1).

This patch carries revert of

llvm/llvm-project#133231. This PR breaks fp_to_subbytes and
emulation_subbyte_types on llvm-cpu tests. iree-test-deps.
tracking issue in iree-org#20645.

llvm/llvm-project#137122. StableHLO and Torch-mlir needs to update their
usage of GreedyRewriteConfig to use fluent API. i.e
enableRegionSimplification = VS setRegionSimplificationLevel

llvm/llvm-project#135970. StableHLO has issue with VHLO_IntegerAttr and
APInt not being updated. StableHLO needs to be updated with that PR's
change for us to be able to integrate.

llvm/llvm-project#121389. Torch-MLIR needs to be
updated with that PR's change for us to be able to integrate.

Signed-off-by: Stanley Winata <stanley.winata@amd.com>
@lialan lialan self-assigned this Apr 26, 2025
raikonenfnu added a commit that referenced this issue Apr 26, 2025
Updated LIT test from landing llvm/llvm-project#136640 which folds
linalg.index when size is unit dim (1).

Added chipSet argument into populateGpuToROCDLConversionPatterns based
on changes in llvm/llvm-project#137360

This patch carries revert of

llvm/llvm-project#133231. This PR breaks fp_to_subbytes and
emulation_subbyte_types on llvm-cpu tests. iree-test-deps. tracker issue
in #20645.

llvm/llvm-project#137122. StableHLO and Torch-mlir needs to update their
usage of GreedyRewriteConfig to use fluent API. i.e
enableRegionSimplification = VS setRegionSimplificationLevel

llvm/llvm-project#135970. StableHLO has issue with VHLO_IntegerAttr and
APInt not being updated. StableHLO needs to be updated with that PR's
change for us to be able to integrate.

llvm/llvm-project#121389. Torch-MLIR needs to be
updated with that PR's change for us to be able to integrate.

---------

Signed-off-by: Stanley Winata <stanley.winata@amd.com>
@krzysz00
Copy link
Contributor

So, I think what should be done there is that the slow path should be guarded by a runtime condition as in llvm/llvm-project#135014

But, in addition, the narrow type emulation should be moved before scf-to-cf - heck, before OptimizeIntArithmetic - so that we can take advantage of divisibility analysis

@lialan
Copy link
Contributor
lialan commented Apr 28, 2025

So, I think what should be done there is that the slow path should be guarded by a runtime condition as in llvm/llvm-project#135014

But, in addition, the narrow type emulation should be moved before scf-to-cf - heck, before OptimizeIntArithmetic - so that we can take advantage of divisibility analysis

That is the way to go. Currently the transformation is also unable to handle some other cases, all needs runtime check.

@krzysz00
Copy link
Contributor

To clarify, which "other cases"?

@lialan
Copy link
Contributor
lialan commented Apr 29, 2025

To clarify, which "other cases"?

All the cases with a non-constant storing index.

@lialan
Copy link
Contributor
lialan commented Apr 29, 2025

@raikonenfnu Was the emulation in the case of non-constant/dynamic indexing already working before? I think it was not working as intended even before this problematic patch.

@krzysz00
Copy link
Contributor

I do think the patch was fixing a real correctness issue, but didn't do it in a performant way

IanWood1 added a commit that referenced this issue Apr 29, 2025
This patch carries 4 reverts from
#20646:

llvm/llvm-project#133231. This PR breaks fp_to_subbytes and
emulation_subbyte_types on llvm-cpu tests. iree-test-deps. tracker issue
in #20645.

llvm/llvm-project#137122. StableHLO and Torch-mlir needs to update their
usage of GreedyRewriteConfig to use fluent API. i.e
enableRegionSimplification = VS setRegionSimplificationLevel

llvm/llvm-project#135970. StableHLO has issue with VHLO_IntegerAttr and
APInt not being updated. StableHLO needs to be updated with that PR's
change for us to be able to integrate.

llvm/llvm-project#121389. Torch-MLIR needs to be
updated with that PR's change for us to be able to integrate.

Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
@banach-space
Copy link
Collaborator

Sorry that my patch is causing this issue - I'd like to see if I can help resolve it.

To start, I’m trying to better understand the underlying problem. When I lower the reproducer (with my patch reverted), I get the following IR:

  func.func @_f32_to_i4_1d_dispatch_0_elementwise_8_f32xi4() {
    %cst = arith.constant dense<4> : vector<2xi32>
    %cst_0 = arith.constant dense<15> : vector<2xi32>
    %c4 = arith.constant 4 : index
    %c8 = arith.constant 8 : index
    %c0 = arith.constant 0 : index
    %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : memref<8xf32>
    memref.assume_alignment %0, 64 : memref<8xf32>
    %1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) flags(Indirect) : memref<4xi8>
    memref.assume_alignment %1, 64 : memref<4xi8>
    cf.br ^bb1(%c0 : index)
  ^bb1(%2: index):  // 2 preds: ^bb0, ^bb2
    %3 = arith.cmpi slt, %2, %c8 : index
    cf.cond_br %3, ^bb2, ^bb3
  ^bb2:  // pred: ^bb1
    %4 = vector.load %0[%2] : memref<8xf32>, vector<4xf32>
    %5 = arith.fptoui %4 : vector<4xf32> to vector<4xi32>
    %6 = affine.apply #map()[%2]
    %7 = vector.shuffle %5, %5 [0, 2] : vector<4xi32>, vector<4xi32>
    %8 = arith.andi %7, %cst_0 : vector<2xi32>
    %9 = vector.shuffle %5, %5 [1, 3] : vector<4xi32>, vector<4xi32>
    %10 = arith.andi %9, %cst_0 : vector<2xi32>
    %11 = arith.shli %10, %cst : vector<2xi32>
    %12 = arith.ori %8, %11 : vector<2xi32>
    %13 = arith.trunci %12 : vector<2xi32> to vector<2xi8>
    vector.store %13, %1[%6] : memref<4xi8>, vector<2xi8>
    %14 = arith.addi %2, %c4 : index
    cf.br ^bb1(%14 : index)
  ^bb3:  // pred: ^bb1
    return
  }

What stands out to me is that %2 (the branch+load/store index) doesn’t appear to be used to mask the input of the vector.store. I’m not yet sure what this constant (that's used for masking) represents ...

    %cst_0 = arith.constant dense<15> : vector<2xi32>

... but it doesn't look correct. So, it feels like the failing test is exercising an incorrect lowering path in upstream MLIR? And my patch effectively disables that.

Does this agree with your understanding?

IanWood1 added a commit that referenced this issue Apr 30, 2025
Carries the same 4 reverts as
#20657:

- llvm/llvm-project#133231. This PR breaks fp_to_subbytes and
emulation_subbyte_types on llvm-cpu tests. iree-test-deps. tracker issue
in #20645.
- llvm/llvm-project#137122. StableHLO and Torch-mlir needs to update
their usage of GreedyRewriteConfig to use fluent API. i.e
enableRegionSimplification = VS setRegionSimplificationLevel
- llvm/llvm-project#135970. StableHLO has issue with VHLO_IntegerAttr
and APInt not being updated. StableHLO needs to be updated with that
PR's change for us to be able to integrate.
- llvm/llvm-project#121389. Torch-MLIR needs to be
updated with that PR's change for us to be able to integrate.

Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
KyleHerndon pushed a commit to KyleHerndon/iree that referenced this issue May 7, 2025
Updated LIT test from landing llvm/llvm-project#136640 which folds
linalg.index when size is unit dim (1).

Added chipSet argument into populateGpuToROCDLConversionPatterns based
on changes in llvm/llvm-project#137360

This patch carries revert of

llvm/llvm-project#133231. This PR breaks fp_to_subbytes and
emulation_subbyte_types on llvm-cpu tests. iree-test-deps. tracker issue
in iree-org#20645.

llvm/llvm-project#137122. StableHLO and Torch-mlir needs to update their
usage of GreedyRewriteConfig to use fluent API. i.e
enableRegionSimplification = VS setRegionSimplificationLevel

llvm/llvm-project#135970. StableHLO has issue with VHLO_IntegerAttr and
APInt not being updated. StableHLO needs to be updated with that PR's
change for us to be able to integrate.

llvm/llvm-project#121389. Torch-MLIR needs to be
updated with that PR's change for us to be able to integrate.

---------

Signed-off-by: Stanley Winata <stanley.winata@amd.com>
KyleHerndon pushed a commit to KyleHerndon/iree that referenced this issue May 7, 2025
This patch carries 4 reverts from
iree-org#20646:

llvm/llvm-project#133231. This PR breaks fp_to_subbytes and
emulation_subbyte_types on llvm-cpu tests. iree-test-deps. tracker issue
in iree-org#20645.

llvm/llvm-project#137122. StableHLO and Torch-mlir needs to update their
usage of GreedyRewriteConfig to use fluent API. i.e
enableRegionSimplification = VS setRegionSimplificationLevel

llvm/llvm-project#135970. StableHLO has issue with VHLO_IntegerAttr and
APInt not being updated. StableHLO needs to be updated with that PR's
change for us to be able to integrate.

llvm/llvm-project#121389. Torch-MLIR needs to be
updated with that PR's change for us to be able to integrate.

Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
KyleHerndon pushed a commit to KyleHerndon/iree that referenced this issue May 7, 2025
Carries the same 4 reverts as
iree-org#20657:

- llvm/llvm-project#133231. This PR breaks fp_to_subbytes and
emulation_subbyte_types on llvm-cpu tests. iree-test-deps. tracker issue
in iree-org#20645.
- llvm/llvm-project#137122. StableHLO and Torch-mlir needs to update
their usage of GreedyRewriteConfig to use fluent API. i.e
enableRegionSimplification = VS setRegionSimplificationLevel
- llvm/llvm-project#135970. StableHLO has issue with VHLO_IntegerAttr
and APInt not being updated. StableHLO needs to be updated with that
PR's change for us to be able to integrate.
- llvm/llvm-project#121389. Torch-MLIR needs to be
updated with that PR's change for us to be able to integrate.

Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>
@hanhanW hanhanW changed the title Upstream narrow type emulation is breaking iree test [Integrate] Upstream narrow type emulation is breaking iree test May 29, 2025
@hanhanW
Copy link
Contributor
hanhanW commented May 30, 2025

I think I know what's happening. Below is the IR that I move the emulation before scf->cf lowering, which looks easier. It also trims down IREE specific ops.

When we do tiling in IREE, the tile sizes are driven by native vector size on CPU. For i4 type emulation, they are always aligned store (i.e., no partial store) because typically the tile sizes are multiple of 2. I.e., it is always aligned in the for-loop.

func.func @main(%arg0: memref<8xf32>, %arg1: memref<8xi4>) {
  %c4 = arith.constant 4 : index
  %c8 = arith.constant 8 : index
  %c0 = arith.constant 0 : index
  scf.for %arg2 = %c0 to %c8 step %c4 {
    %0 = vector.load %arg0[%arg2] : memref<8xf32>, vector<4xf32>
    %1 = arith.fptoui %0 : vector<4xf32> to vector<4xi32>
    %2 = arith.trunci %1 : vector<4xi32> to vector<4xi4>
    vector.store %2, %arg1[%arg2] : memref<8xi4>, vector<4xi4>
  }
  return
}

The upstream fix is reasonable, and it enables the support for unaligned cases. Here, I think we miss a hint for the pattern. This can either be:

  1. Add a mode to upstream patterns that assumes that stores are always aligned. Then it is user's responsibility to tile it correct.
  2. We need an integer range analysis that identify if the flattened index is an aligned case or not. If so, we can always convert it to vector.bitcast and do the store. Otherwise, we'll need to support it separately.

I don't know how to achieve (2) at the moment, maybe ValueRange analysis can help. I only use it for querying upper_bound. Let me take a look if it can check if a value is a multiple of something or not.

In IREE, we have util.int.assume op that tells you if it can be divisible by udiv or not.

%0:2 = util.assume.int
%m_in<umin = 16, umax = 4080, udiv = 16>,
%k2_in<umin = 16, umax = 4080, udiv = 32>
: index, index

@hanhanW
Copy link
Contributor
hanhanW commented May 30, 2025

Okay, I think upstream may not support this. To fix it properly in IREE, I think we should move some patterns back to IREE and use TensorDynamicDimAnalysis to get the information. We can prioritize the IREE patterns in this case.

@benvanik
Copy link
Collaborator

yeah, especially for stuff like this I just assume (hah) upstream doesn't support what we need in conjunction with our better type support and assume ops.

really good detective work on this - very tricky interactions!

@hanhanW
Copy link
Contributor
hanhanW commented May 30, 2025

@lialan may be busy on other stuff. Assigning it to me for now, and I can find someone or me to work on it. (cc @MaheshRavishankar )

@hanhanW hanhanW assigned hanhanW and unassigned lialan May 30, 2025
@hanhanW hanhanW added codegen Shared code generation infrastructure and dialects enhancement ➕ New feature or request onboarding/codegen Tasks suitable for new team member onboarding labels May 30, 2025
@krzysz00
Copy link
Contributor

We could also get around to adding align = N to vector.load and vector.store and such, which is @efric (and then an absent align = or an align = 0 would mean we need the unaligned case)

@krzysz00
Copy link
Contributor

Another thing that can be done upstream is to put the slow and the fast cases in two arms of an if statement and let integer range analysis clear off the slow path.

This is what got done for the buffer OOB stuff

@hanhanW
Copy link
Contributor
hanhanW commented May 30, 2025

We could also get around to adding align = N to vector.load and vector.store and such, which is @efric (and then an absent align = or an align = 0 would mean we need the unaligned case)

I'd prefer not using this approach atm, because we can get all the needed information from IR. I can be convinced if there are other use cases.

Another thing that can be done upstream is to put the slow and the fast cases in two arms of an if statement and let integer range analysis clear off the slow path.

I'm not sure how integrate range analysis works with upstream patterns. It seems like the analysis should only be run once because of efficiency. Then we collect the ops and apply the transform. Also, the analysis is only available in IREE, so I'm not sure how they can be connected.

Anyway, I think we have a solid plan, and we can fix it if someone pick this up.

@benvanik
Copy link
Collaborator

Agreed RE upstream analysis interaction - I think we could provide interface implementations if upstream uses them that distills the info (I don't think we do). The particular benefit of the util.assume.int op is the correlation between independent SSA values ("if A is udiv=4 then B is udiv=8" or "if A is umin=8 then B is umin=16"). We can do a lot more with that (@qedawkins specialization is an example - specializing just for the sets of assumptions vs the full combinatorial explosion of them). We should probably plan on keeping anything important relying on analysis on the IREE side but exposing what we can (the union/intersection of all our assume pairs exposed as a single value range).

@krzysz00
Copy link
Contributor
krzysz00 commented May 30, 2025

I'm not sure how integrate range analysis works with upstream patterns. It seems like the analysis should only be run once because of efficiency. Then we collect the ops and apply the transform. Also, the analysis is only available in IREE, so I'm not sure how they can be connected.

Integer range analysis lives upstream - the only thing that doesn't is divisibility analysis, and that should be fairly easily upstreamable

And even if integer range analysis doesn't work, having the pattern gener9ate

if (linearrI4Index % 2 == 0) { fastPath } else { slowPath }

instead of just generating slowPath in cases where the divisibility isn't statically obvious will allow LLVM to take care of it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🐞 Something isn't working codegen Shared code generation infrastructure and dialects enhancement ➕ New feature or request onboarding/codegen Tasks suitable for new team member onboarding
Projects
None yet
Development

No branches or pull requests

6 participants
0