-
Notifications
You must be signed in to change notification settings - Fork 702
[GPU] compilation failure for alternative bwd grouped conv #20498
10000New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The compile fails with:
The IR with func.func @conv_2d_float32_input_backward_128x24x48x384_nhwc_384x1x3x128_fhwc_nhwf_1x1s_0x1p_1x1d_3g$async_dispatch_0_conv_128x24x48x3x128x128x3_f32() {
%c0 = arith.constant 0 : index
%cst = arith.constant 0.000000e+00 : f32
%c2 = arith.constant 2 : index
%0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<128x24x48x384xf32>>
%1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<384x1x3x128xf32>>
%2 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(2) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<128x24x48x384xf32>>
%3 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0, 0], sizes = [128, 24, 48, 384], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<128x24x48x384xf32>> -> tensor<128x24x48x384xf32>
%4 = flow.dispatch.tensor.load %1, offsets = [0, 0, 0, 0], sizes = [384, 1, 3, 128], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<384x1x3x128xf32>> -> tensor<384x1x3x128xf32>
%expanded = tensor.expand_shape %3 [[0], [1], [2], [3, 4]] output_shape [128, 24, 48, 3, 128] : tensor<128x24x48x384xf32> into tensor<128x24x48x3x128xf32>
%padded = tensor.pad %expanded low[0, 0, 1, 0, 0] high[0, 0, 1, 0, 0] {
^bb0(%arg0: index, %arg1: index, %arg2: index, %arg3: index, %arg4: index):
tensor.yield %cst : f32
} : tensor<128x24x48x3x128xf32> to tensor<128x24x50x3x128xf32>
%expanded_0 = tensor.expand_shape %4 [[0, 1], [2], [3], [4]] output_shape [3, 128, 1, 3, 128] : tensor<384x1x3x128xf32> into tensor<3x128x1x3x128xf32>
%5 = tensor.empty() : tensor<3x128x1x3x128xf32>
%6 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4) -> (d0, d1, d2, d3, d4)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel"]} outs(%5 : tensor<3x128x1x3x128xf32>) {
^bb0(%out: f32):
%10 = linalg.index 0 : index
%11 = linalg.index 2 : index
%12 = linalg.index 1 : index
%13 = arith.addi %11, %12 : index
%14 = linalg.index 3 : index
%15 = linalg.index 4 : index
%16 = arith.subi %c2, %14 : index
%extracted = tensor.extract %expanded_0[%10, %13, %c0, %16, %15] : tensor<3x128x1x3x128xf32>
linalg.yield %extracted : f32
} -> tensor<3x128x1x3x128xf32>
%collapsed = tensor.collapse_shape %6 [[0], [1, 2], [3], [4]] : tensor<3x128x1x3x128xf32> into tensor<3x128x3x128xf32>
%7 = tensor.empty() : tensor<128x24x48x3x128xf32>
%8 = linalg.fill ins(%cst : f32) outs(%7 : tensor<128x24x48x3x128xf32>) -> tensor<128x24x48x3x128xf32>
%9 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2 + d6, d3, d5)>, affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d3, d5, d6, d4)>, affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2, d3, d4)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "reduction", "reduction"]} ins(%padded, %collapsed : tensor<128x And for reference here's what the working IR looks like: https://gist.github.com/rkayaith/d322dd974cdc41fe3a3ea2d62981ce1c#file-2-working-exec-source-mlir-L11 The compilation IR dump of the failing IR is here: https://gist.github.com/rkayaith/d322dd974cdc41fe3a3ea2d62981ce1c#file-3-failing-ir-after-all-mlir |
The
If I hack func.func @conv_2d_float32_input_backward_128x24x48x384_nhwc_384x1x3x128_fhwc_nhwf_1x1s_0x1p_1x1d_3g$async_dispatch_0_conv_128x24x48x3x128x128x3_f32() {
%c0 = arith.constant 0 : index
%cst = arith.constant 0.000000e+00 : f32
%c2 = arith.constant 2 : index
%0 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(0) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<128x24x48x384xf32>>
%1 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(1) alignment(64) offset(%c0) flags("ReadOnly|Indirect") : !flow.dispatch.tensor<readonly:tensor<384x1x3x128xf32>>
%2 = hal.interface.binding.subspan layout(<bindings = [#hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, "ReadOnly|Indirect">, #hal.pipeline.binding<storage_buffer, Indirect>], flags = Indirect>) binding(2) alignment(64) offset(%c0) flags(Indirect) : !flow.dispatch.tensor<writeonly:tensor<128x24x48x384xf32>>
%3 = flow.dispatch.tensor.load %0, offsets = [0, 0, 0, 0], sizes = [128, 24, 48, 384], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<128x24x48x384xf32>> -> tensor<128x24x48x384xf32>
%4 = flow.dispatch.tensor.load %1, offsets = [0, 0, 0, 0], sizes = [384, 1, 3, 128], strides = [1, 1, 1, 1] : !flow.dispatch.tensor<readonly:tensor<384x1x3x128xf32>> -> tensor<384x1x3x128xf32>
%expanded = tensor.expand_shape %3 [[0], [1], [2], [3, 4]] output_shape [128, 24, 48, 3, 128] : tensor<128x24x48x384xf32> into tensor<128x24x48x3x128xf32>
%padded = tensor.pad %expanded low[0, 0, 1, 0, 0] high[0, 0, 1, 0, 0] {
^bb0(%arg0: index, %arg1: index, %arg2: index, %arg3: index, %arg4: index):
tensor.yield %cst : f32
} : tensor<128x24x48x3x128xf32> to tensor<128x24x50x3x128xf32>
%expanded_0 = tensor.expand_shape %4 [[0, 1], [2], [3], [4]] output_shape [3, 128, 1, 3, 128] : tensor<384x1x3x128xf32> into tensor<3x128x1x3x128xf32>
%5 = tensor.empty() : tensor<3x128x3x128xf32>
%6 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3) -> (d0, d1, d2, d3)>], iterator_types = ["parallel", "parallel", "parallel", "parallel"]} outs(%5 : tensor<3x128x3x128xf32>) {
^bb0(%out: f32):
%10 = linalg.index 0 : index
%11 = linalg.index 1 : index
%12 = linalg.index 2 : index
%13 = linalg.index 3 : index
%14 = arith.subi %c2, %12 : index
%extracted = tensor.extract %expanded_0[%10, %11, %c0, %14, %13] : tensor<3x128x1x3x128xf32>
linalg.yield %extracted : f32
} -> tensor<3x128x3x128xf32>
%7 = tensor.empty() : tensor<128x24x48x3x128xf32>
%8 = linalg.fill ins(%cst : f32) outs(%7 : tensor<128x24x48x3x128xf32>) -> tensor<128x24x48x3x128xf32>
%9 = linalg.generic {indexing_maps = [affine_map<(d0, d1, d2, d3, d4, d5, d6) ->
8000
(d0, d1, d2 + d6, d3, d5)>, affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d3, d5, d6, d4)>, affine_map<(d0, d1, d2, d3, d4, d5, d6) -> (d0, d1, d2, d3, d4)>], iterator_types = ["parallel", "parallel", "parallel", "parallel", "parallel", "reduction", "reduction"]} ins(%padded, %6 : tensor<128x24x50x3x128xf32>, tensor<3x128x3x128xf32>) outs(%8 : tensor<128x24x48x3x128xf32>) {
^bb0(%in: f32, %in_1: f32, %out: f32):
%10 = arith.mulf %in, %in_1 : f32
%11 = arith.addf %out, %10 : f32
linalg.yield %11 : f32
} -> tensor<128x24x48x3x128xf32>
%collapsed = tensor.collapse_shape %9 [[0], [1], [2], [3, 4]] : tensor<128x24x48x3x128xf32> into tensor<128x24x48x384xf32>
flow.dispatch.tensor.store %collapsed, %2, offsets = [0, 0, 0, 0], sizes = [128, 24, 48, 384], strides = [1, 1, 1, 1] : tensor<128x24x48x384xf32> -> !flow.dispatch.tensor<writeonly:tensor<128x24x48x384xf32>>
return
} But the API for adding this pattern includes some other patterns which we may not want: https://github.com/llvm/llvm-project/blob/747d4a952bf7ed4adec72ddf3c9038aeff4fe8ee/mlir/lib/Dialect/Linalg/Transforms/ElementwiseOpFusion.cpp#L2259-L2264 @nirvedhmeshram any thoughts on how to proceed here? |
It is missing pattern in the pipeline to do |
Not yet, but with group convs going down the igemm path, this will get resolved. |
No we put it on hold until we enable grouped conv with IGEMM as that might solve the problem or change the issues at the very least. |
This adds support for converting group convs to im2col, allowing them to go down the IGEMM path. Group dimensions are parallel iterator dims that index into the image, filter, and output. For im2col they are treated as a batch dimension. This also fixes iree-org#20498
What happened?
Two similar sets of IR for backward grouped convolution are provided.
The first performs grouped-dim expansion and collapse around the conv, which happens after filter spatial dim flips and
dLdy
padding (this compiles), and the second performs expand/collapse at the function boundaries and applies filter flipping and padding after expansion (but fails to compile).This IR Compiles
This IR Fails To Compile
Steps to reproduce your issue
Try to compile both examples with
What component(s) does this issue relate to?
Compiler
Version information
pip install:
iree-base-compiler 3.4.0rc20250408
Additional context
No response
The text was updated successfully, but these errors were encountered: