[Integrate] Upstream narrow type emulation is breaking iree test #20645 8000

raikonenfnu · 2025-04-26T03:39:44Z

What happened?

llvm/llvm-project#133231 seems to be breaking our subbyte emulation tests.

A simple repro is https://gist.github.com/raikonenfnu/1ea07b7e231d8997bfa1c29502df637d

original test https://github.com/iree-org/iree/blob/1a8d229431e62b50eb297e75ca4bf1dba3b67f65/tests/e2e/linalg/fp_to_subbyte.mlir

Based on the above test, currently IREE is generating code that has different trailing dim and "virtually"/fake non constant offset into the store vector.store %0[%2] where %2 is actually a constant, but since it is inside a branch region, %c0 becomes %2 which is a region argument and is now a non-constant.

The condition above leads our program into this code path where it cannot determine !foldedNumFrontPadElems and fails this lowering. as seen in:
https://github.com/llvm/llvm-project/blob/2de936b6eb38e7a37224a97c2a22aa79b9dfb9dc/mlir/lib/Dialect/Vector/Transforms/VectorEmulateNarrowType.cpp#L619-L629

IIUC, that PR is somewhat correct, as long as the trailing dim does not match, we may need partial stores.

I don't think I have enough context on the best way to solve this. We can:

Fold away that region since the condition to the regions are constants or somehow constant fold S.T we have constant offset in vector.store
Add some code in vector dialect's EmulateNarrowType vector::StoreOp conversion to handle non constant cases through series of bitcasting and memref.generic_atomic_rmws. (this seems much harder and require some more thinking)

Steps to reproduce your issue

wget https://gist.githubusercontent.com/raikonenfnu/1ea07b7e231d8997bfa1c29502df637d/raw/6e3311f6794dd5afeaeac0f82422cbe498d67272/emulate_failure.mlir
iree-opt --iree-codegen-emulate-narrow-type emulate_failure.mlir -o out.mlir

emulate.mlir:18:3: error: failed to legalize operation 'vector.store' that was explicitly marked illegal

What component(s) does this issue relate to?

No response

Version information

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

Updated LIT test from landing llvm/llvm-project#136640 which folds linalg.index when size is unit dim (1). This patch carries revert of llvm/llvm-project#133231. This PR breaks fp_to_subbytes and emulation_subbyte_types on llvm-cpu tests. iree-test-deps. tracking issue in iree-org#20645. llvm/llvm-project#137122. StableHLO and Torch-mlir needs to update their usage of GreedyRewriteConfig to use fluent API. i.e enableRegionSimplification = VS setRegionSimplificationLevel llvm/llvm-project#135970. StableHLO has issue with VHLO_IntegerAttr and APInt not being updated. StableHLO needs to be updated with that PR's change for us to be able to integrate. llvm/llvm-project#121389. Torch-MLIR needs to be updated with that PR's change for us to be able to integrate. Signed-off-by: Stanley Winata <stanley.winata@amd.com>

Updated LIT test from landing llvm/llvm-project#136640 which folds linalg.index when size is unit dim (1). Added chipSet argument into populateGpuToROCDLConversionPatterns based on changes in llvm/llvm-project#137360 This patch carries revert of llvm/llvm-project#133231. This PR breaks fp_to_subbytes and emulation_subbyte_types on llvm-cpu tests. iree-test-deps. tracker issue in #20645. llvm/llvm-project#137122. StableHLO and Torch-mlir needs to update their usage of GreedyRewriteConfig to use fluent API. i.e enableRegionSimplification = VS setRegionSimplificationLevel llvm/llvm-project#135970. StableHLO has issue with VHLO_IntegerAttr and APInt not being updated. StableHLO needs to be updated with that PR's change for us to be able to integrate. llvm/llvm-project#121389. Torch-MLIR needs to be updated with that PR's change for us to be able to integrate. --------- Signed-off-by: Stanley Winata <stanley.winata@amd.com>

krzysz00 · 2025-04-26T05:09:14Z

So, I think what should be done there is that the slow path should be guarded by a runtime condition as in llvm/llvm-project#135014

But, in addition, the narrow type emulation should be moved before scf-to-cf - heck, before OptimizeIntArithmetic - so that we can take advantage of divisibility analysis

lialan · 2025-04-28T14:58:59Z

So, I think what should be done there is that the slow path should be guarded by a runtime condition as in llvm/llvm-project#135014

But, in addition, the narrow type emulation should be moved before scf-to-cf - heck, before OptimizeIntArithmetic - so that we can take advantage of divisibility analysis

That is the way to go. Currently the transformation is also unable to handle some other cases, all needs runtime check.

krzysz00 · 2025-04-28T23:06:33Z

To clarify, which "other cases"?

lialan · 2025-04-29T15:39:57Z

To clarify, which "other cases"?

All the cases with a non-constant storing index.

lialan · 2025-04-29T15:46:55Z

@raikonenfnu Was the emulation in the case of non-constant/dynamic indexing already working before? I think it was not working as intended even before this problematic patch.

krzysz00 · 2025-04-29T16:10:43Z

I do think the patch was fixing a real correctness issue, but didn't do it in a performant way

This patch carries 4 reverts from #20646: llvm/llvm-project#133231. This PR breaks fp_to_subbytes and emulation_subbyte_types on llvm-cpu tests. iree-test-deps. tracker issue in #20645. llvm/llvm-project#137122. StableHLO and Torch-mlir needs to update their usage of GreedyRewriteConfig to use fluent API. i.e enableRegionSimplification = VS setRegionSimplificationLevel llvm/llvm-project#135970. StableHLO has issue with VHLO_IntegerAttr and APInt not being updated. StableHLO needs to be updated with that PR's change for us to be able to integrate. llvm/llvm-project#121389. Torch-MLIR needs to be updated with that PR's change for us to be able to integrate. Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>

banach-space · 2025-04-30T15:51:34Z

Sorry that my patch is causing this issue - I'd like to see if I can help resolve it.

To start, I’m trying to better understand the underlying problem. When I lower the reproducer (with my patch reverted), I get the following IR:

  func.func @_f32_to_i4_1d_dispatch_0_elementwise_8_f32xi4() {
    %cst = arith.constant dense<4> : vector<2xi32>
    %cst_0 = arith.constant dense<15> : vector<2xi32>
    %c4 = arith.constant 4 : index
    %c8 = arith.constant 8 : index
    %c0 = arith.constant 0 : index
    %0 = hal.interface.binding.subspan layout(#pipeline_layout) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : memref<8xf32>
    memref.assume_alignment %0, 64 : memref<8xf32>
    %1 = hal.interface.binding.subspan layout(#pipeline_layout) binding(1) alignment(64) offset(%c0) flags(Indirect) : memref<4xi8>
    memref.assume_alignment %1, 64 : memref<4xi8>
    cf.br ^bb1(%c0 : index)
  ^bb1(%2: index):  // 2 preds: ^bb0, ^bb2
    %3 = arith.cmpi slt, %2, %c8 : index
    cf.cond_br %3, ^bb2, ^bb3
  ^bb2:  // pred: ^bb1
    %4 = vector.load %0[%2] : memref<8xf32>, vector<4xf32>
    %5 = arith.fptoui %4 : vector<4xf32> to vector<4xi32>
    %6 = affine.apply #map()[%2]
    %7 = vector.shuffle %5, %5 [0, 2] : vector<4xi32>, vector<4xi32>
    %8 = arith.andi %7, %cst_0 : vector<2xi32>
    %9 = vector.shuffle %5, %5 [1, 3] : vector<4xi32>, vector<4xi32>
    %10 = arith.andi %9, %cst_0 : vector<2xi32>
    %11 = arith.shli %10, %cst : vector<2xi32>
    %12 = arith.ori %8, %11 : vector<2xi32>
    %13 = arith.trunci %12 : vector<2xi32> to vector<2xi8>
    vector.store %13, %1[%6] : memref<4xi8>, vector<2xi8>
    %14 = arith.addi %2, %c4 : index
    cf.br ^bb1(%14 : index)
  ^bb3:  // pred: ^bb1
    return
  }

What stands out to me is that %2 (the branch+load/store index) doesn’t appear to be used to mask the input of the vector.store. I’m not yet sure what this constant (that's used for masking) represents ...

    %cst_0 = arith.constant dense<15> : vector<2xi32>

... but it doesn't look correct. So, it feels like the failing test is exercising an incorrect lowering path in upstream MLIR? And my patch effectively disables that.

Does this agree with your understanding?

Carries the same 4 reverts as #20657: - llvm/llvm-project#133231. This PR breaks fp_to_subbytes and emulation_subbyte_types on llvm-cpu tests. iree-test-deps. tracker issue in #20645. - llvm/llvm-project#137122. StableHLO and Torch-mlir needs to update their usage of GreedyRewriteConfig to use fluent API. i.e enableRegionSimplification = VS setRegionSimplificationLevel - llvm/llvm-project#135970. StableHLO has issue with VHLO_IntegerAttr and APInt not being updated. StableHLO needs to be updated with that PR's change for us to be able to integrate. - llvm/llvm-project#121389. Torch-MLIR needs to be updated with that PR's change for us to be able to integrate. Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>

Updated LIT test from landing llvm/llvm-project#136640 which folds linalg.index when size is unit dim (1). Added chipSet argument into populateGpuToROCDLConversionPatterns based on changes in llvm/llvm-project#137360 This patch carries revert of llvm/llvm-project#133231. This PR breaks fp_to_subbytes and emulation_subbyte_types on llvm-cpu tests. iree-test-deps. tracker issue in iree-org#20645. llvm/llvm-project#137122. StableHLO and Torch-mlir needs to update their usage of GreedyRewriteConfig to use fluent API. i.e enableRegionSimplification = VS setRegionSimplificationLevel llvm/llvm-project#135970. StableHLO has issue with VHLO_IntegerAttr and APInt not being updated. StableHLO needs to be updated with that PR's change for us to be able to integrate. llvm/llvm-project#121389. Torch-MLIR needs to be updated with that PR's change for us to be able to integrate. --------- Signed-off-by: Stanley Winata <stanley.winata@amd.com>

This patch carries 4 reverts from iree-org#20646: llvm/llvm-project#133231. This PR breaks fp_to_subbytes and emulation_subbyte_types on llvm-cpu tests. iree-test-deps. tracker issue in iree-org#20645. llvm/llvm-project#137122. StableHLO and Torch-mlir needs to update their usage of GreedyRewriteConfig to use fluent API. i.e enableRegionSimplification = VS setRegionSimplificationLevel llvm/llvm-project#135970. StableHLO has issue with VHLO_IntegerAttr and APInt not being updated. StableHLO needs to be updated with that PR's change for us to be able to integrate. llvm/llvm-project#121389. Torch-MLIR needs to be updated with that PR's change for us to be able to integrate. Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>

Carries the same 4 reverts as iree-org#20657: - llvm/llvm-project#133231. This PR breaks fp_to_subbytes and emulation_subbyte_types on llvm-cpu tests. iree-test-deps. tracker issue in iree-org#20645. - llvm/llvm-project#137122. StableHLO and Torch-mlir needs to update their usage of GreedyRewriteConfig to use fluent API. i.e enableRegionSimplification = VS setRegionSimplificationLevel - llvm/llvm-project#135970. StableHLO has issue with VHLO_IntegerAttr and APInt not being updated. StableHLO needs to be updated with that PR's change for us to be able to integrate. - llvm/llvm-project#121389. Torch-MLIR needs to be updated with that PR's change for us to be able to integrate. Signed-off-by: Ian Wood <ianwood2024@u.northwestern.edu>

hanhanW · 2025-05-30T04:18:44Z

I think I know what's happening. Below is the IR that I move the emulation before scf->cf lowering, which looks easier. It also trims down IREE specific ops.

When we do tiling in IREE, the tile sizes are driven by native vector size on CPU. For i4 type emulation, they are always aligned store (i.e., no partial store) because typically the tile sizes are multiple of 2. I.e., it is always aligned in the for-loop.

func.func @main(%arg0: memref<8xf32>, %arg1: memref<8xi4>) {
  %c4 = arith.constant 4 : index
  %c8 = arith.constant 8 : index
  %c0 = arith.constant 0 : index
  scf.for %arg2 = %c0 to %c8 step %c4 {
    %0 = vector.load %arg0[%arg2] : memref<8xf32>, vector<4xf32>
    %1 = arith.fptoui %0 : vector<4xf32> to vector<4xi32>
    %2 = arith.trunci %1 : vector<4xi32> to vector<4xi4>
    vector.store %2, %arg1[%arg2] : memref<8xi4>, vector<4xi4>
  }
  return
}

The upstream fix is reasonable, and it enables the support for unaligned cases. Here, I think we miss a hint for the pattern. This can either be:

Add a mode to upstream patterns that assumes that stores are always aligned. Then it is user's responsibility to tile it correct.
We need an integer range analysis that identify if the flattened index is an aligned case or not. If so, we can always convert it to vector.bitcast and do the store. Otherwise, we'll need to support it separately.

I don't know how to achieve (2) at the moment, maybe ValueRange analysis can help. I only use it for querying upper_bound. Let me take a look if it can check if a value is a multiple of something or not.

In IREE, we have util.int.assume op that tells you if it can be divisible by udiv or not.

iree/compiler/src/iree/compiler/Codegen/Common/test/block_dynamic_dims.mlir

Lines 14 to 17 in 5140464

    
           %0:2 = util.assume.int 
        
               %m_in<umin = 16, umax = 4080, udiv = 16>, 
        
               %k2_in<umin = 16, umax = 4080, udiv = 32> 
        
             : index, index

hanhanW · 2025-05-30T04:27:17Z

Okay, I think upstream may not support this. To fix it properly in IREE, I think we should move some patterns back to IREE and use TensorDynamicDimAnalysis to get the information. We can prioritize the IREE patterns in this case.

benvanik · 2025-05-30T04:30:16Z

yeah, especially for stuff like this I just assume (hah) upstream doesn't support what we need in conjunction with our better type support and assume ops.

really good detective work on this - very tricky interactions!

hanhanW · 2025-05-30T04:31:24Z

@lialan may be busy on other stuff. Assigning it to me for now, and I can find someone or me to work on it. (cc @MaheshRavishankar )

krzysz00 · 2025-05-30T06:13:15Z

We could also get around to adding align = N to vector.load and vector.store and such, which is @efric (and then an absent align = or an align = 0 would mean we need the unaligned case)

krzysz00 · 2025-05-30T06:18:37Z

Another thing that can be done upstream is to put the slow and the fast cases in two arms of an if statement and let integer range analysis clear off the slow path.

This is what got done for the buffer OOB stuff

hanhanW · 2025-05-30T06:34:18Z

We could also get around to adding align = N to vector.load and vector.store and such, which is @efric (and then an absent align = or an align = 0 would mean we need the unaligned case)

I'd prefer not using this approach atm, because we can get all the needed information from IR. I can be convinced if there are other use cases.

Another thing that can be done upstream is to put the slow and the fast cases in two arms of an if statement and let integer range analysis clear off the slow path.

I'm not sure how integrate range analysis works with upstream patterns. It seems like the analysis should only be run once because of efficiency. Then we collect the ops and apply the transform. Also, the analysis is only available in IREE, so I'm not sure how they can be connected.

Anyway, I think we have a solid plan, and we can fix it if someone pick this up.

benvanik · 2025-05-30T06:44:42Z

Agreed RE upstream analysis interaction - I think we could provide interface implementations if upstream uses them that distills the info (I don't think we do). The particular benefit of the util.assume.int op is the correlation between independent SSA values ("if A is udiv=4 then B is udiv=8" or "if A is umin=8 then B is umin=16"). We can do a lot more with that (@qedawkins specialization is an example - specializing just for the sets of assumptions vs the full combinatorial explosion of them). We should probably plan on keeping anything important relying on analysis on the IREE side but exposing what we can (the union/intersection of all our assume pairs exposed as a single value range).

krzysz00 · 2025-05-30T06:53:46Z

I'm not sure how integrate range analysis works with upstream patterns. It seems like the analysis should only be run once because of efficiency. Then we collect the ops and apply the transform. Also, the analysis is only available in IREE, so I'm not sure how they can be connected.

Integer range analysis lives upstream - the only thing that doesn't is divisibility analysis, and that should be fairly easily upstreamable

And even if integer range analysis doesn't work, having the pattern gener9ate

if (linearrI4Index % 2 == 0) { fastPath } else { slowPath }

instead of just generating slowPath in cases where the divisibility isn't statically obvious will allow LLVM to take care of it

raikonenfnu added the bug 🐞 Something isn't working label Apr 26, 2025

raikonenfnu mentioned this issue Apr 26, 2025

Integrate llvm/llvm-project@1c8e5e2 #20646

Merged

lialan self-assigned this Apr 26, 2025

IanWood1 mentioned this issue Apr 28, 2025

Integrate llvm/llvm-project@5953f19 #20657

Merged

IanWood1 mentioned this issue Apr 29, 2025

Integrate llvm/llvm-project@7b70fc7 #20674

Merged

hanhanW changed the title ~~Upstream narrow type emulation is breaking iree test~~ [Integrate] Upstream narrow type emulation is breaking iree test May 29, 2025

hanhanW assigned hanhanW and unassigned lialan May 30, 2025

hanhanW added codegen Shared code generation infrastructure and dialects enhancement ➕ New feature or request onboarding/codegen Tasks suitable for new team member onboarding labels May 30, 2025

hanhanW mentioned this issue May 30, 2025

Integrate llvm-project@587d6fcbb685e3a57 #20948

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Integrate] Upstream narrow type emulation is breaking iree test #20645 8000

[Integrate] Upstream narrow type emulation is breaking iree test #20645

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

[Integrate] Upstream narrow type emulation is breaking iree test #20645 8000

[Integrate] Upstream narrow type emulation is breaking iree test #20645

Comments

What happened?

Steps to reproduce your issue

What component(s) does this issue relate to?

Version information

Additional context

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!