Loop support: prefer `linalg` over generic `affine.for` #1569

ZenithalHourlyRate · 2025-03-16T23:12:02Z

HEIR has supported affine.for, but in the eyes of dataflow analysis, generic affine.for with arbitrary body is hard to understand (understanding loop has always been one of the hard part of compiler). I propose using linalg to describe what we have now.

I further argue (not 100% sure) that any affine.for that could be understood/accepted by FHE-specific analysis can always be rewritten in linalg with primitives like elementwise/reduce, as the task of the analysis is to understand what part of affine.for's body is elementwise and how the loop-carried variable forms a reduction. The is somewhat similar to the map-reduce programming model.

`affine.for` usage

Semantically there are two places we use affine.for

User input (data semantic), like roberts_cross/box_blur where the algorithm is naturally described in for loops
tensors of ciphertext (ciphertext semantic), like halevi-shoup matmul.

Data semantic

For data semantic, we already need to mitigate it in some way.

As we are already using full-loop-unroll and heir-vectorizer to handle these loops, further lowering into the FHE world should not see them
Alternatively, the layout packing pass is responsible for transforming the data semantic loop into ciphertext-semantic loop, then that pass could also rewrite it in linalg.

So the problem reduces to loop support in ciphertext semantic.

Common ciphertext-semantic patterns in current HEIR codebase

There are quite a lot of pattern like

affine.for tensor<NxMx!ct>
  tensor.extract one element
  computation
  tensor.insert into loop-carried variable

or

affine.for memref
  memref.load one element
  computation
  memref.store into memory

that is actually elementwise op in input + reduction on the loop carried variable. Instead of letting the lowering emit such fused loop and let analysis divide them apart, we can divide them apart early like

%res_tensor = linalg.elementwise_binary { body }
%res_scalar = linalg.reduce %res_tensor

Case study for the matmul kernel

The halevi-shoup lowering should lower from data-semantic linalg.matmul to cipher-text-semantic combinations of linalg.some_op.

The current lowering in main is like

      %2 is the reduction variable (result)
      %input0 is the input vector x
      %arg3 is progressive rotation of x
      %3:2 = affine.for %arg1 = 1 to 1024 iter_args(%arg2 = %2, %arg3 = %input0) -> (tensor<1x1024xf32>, tensor<1x1024xf32>) {
        %4 = tensor_ext.rotate %arg3, %c1 : tensor<1x1024xf32>, index
        %extracted_slice = tensor.extract_slice %cst_0[%arg1, 0] [1, 1024] [1, 1] : tensor<1024x1024xf32> to tensor<1x1024xf32>
        %5 = arith.mulf %4, %extracted_slice : tensor<1x1024xf32>
        %6 = arith.addf %arg2, %5 : tensor<1x1024xf32>
        affine.yield %6, %4 : tensor<1x1024xf32>, tensor<1x1024xf32>
      }

The elementwise part of it is clearly mulf (with tensor splatting), and reduction is addf, but the rotate part is quite hard to understand. In analysis we have to write an individual rule for it to know that it is progressive rotation, and NoiseAnalysis is quite unhappy with it, see #1517 (comment)

Instead we could have

%rotated_tensor_of_input = linalg.elementwise { rotate } on %input_splatted
%mul_result = linalg.elementwise { mulf } on %weight and %rotated_tensor_of_input
%result = linalg.reduce %mul_result

And the %rotated_tensor_of_input can be produced in many ways like 8-4-2-1 or BSGS (32-16-8-4/3,2,1) way instead of rotate by each index / progressively rotate.

Analysis requirement

There are these following types of analysis in HEIR FHE world that must properly handle loop to have correct lowering

Secretness Analysis (for whether it is secret or not)
Level analysis (for RNS level)
Dimension Analysis (for ciphertext size)
Scale Analysis (for CKKS/BGV scale management)
Noise Analysis (for correctly determining noise bound and parameter selection)

With linalg clearly stating the batch behavior instead of generic affine.for, the lives of these analyses could be much easier. Otherwise each of them need to separate the loop into different semantic part and understand it (quick reflection: how could analysis understand the rotate %c1 in halevi-shoup kernel above)

Another point that supports linalg is that, as if is hard to analyse, we already have convert-if-to-select so it becomes something like .filter in map-reduce language. And it can be expressed in linalg.elementwise.

Discussion on additional cost

Separate them forms much bigger memory cost, as what map-reduce would do. For now all the partial results are stored instead of being reduced immediately.

We should seek optimization from the loop-transformation world like fusion/tiling, after the FHE lowering. I think this is quite standard way like semantic first and optimization later instead of writing an optimized affine.for early.

Discussion on lowering

We eventually need to lower it. I think that is at the boundary of the backend where linalg is expressed in c++ for.

Discussion on frontend

If such reasoning above is acceptable, we might encourage frontend user / kernel writer to use map-reduce-ish language / functional programming way to express things instead of for loop as the former would be much semantically clear.

The text was updated successfully, but these errors were encountered:

asraa · 2025-03-17T14:25:05Z

%rotated_tensor_of_input = linalg.elementwise { rotate } on %input_splatted

+1 on this too. I noticed in the MLP example when I had the for loops in, the noise would grow more because of the iterative rotation by 1. It would be nicer to be able to lower and implement the rotations in a custom way (e.g. by the initial + 1, initial + 2, initial + 3, etc for minimal noise growth).

j2kun · 2025-03-17T17:14:40Z

After discussing this in office hours, I think we came to the following agreement:

We should prefer using linalg ops when possible, because they are simpler for the noise analysis to understand.
We want to try hard to make the noise analysis as generic as possible, in the interest of supporting general user input programs even in worst-case settings where the noise analysis is completely helpless. This means we should find a way to support loops in generality.
Even with linalg ops, there are noise-impactful trade-offs that may not be easily expressed in linalg ops. For example, with Halevi-Shoup you need to perform n distinct rotations. You could do a linalg.elementwise where one operand is the (replicated) vector you want to rotate and one operand is the range {0..n-1} of shifts required. However, that can materialize in many different ways, the two extremes being (a) generate all n rotation keys and do the rotations in parallel and (b) generate the single rotation key for rotation by 1 and apply the rotations iteratively. (a) uses more key material but would be both faster and incur less noise growth, while (b) incurs the most noise growth but uses the least key material. Somewhere in between are techniques like baby-step-giant-step.

For (2) we talked about the idea of making a custom pass that handles affine.for loops and enforces iteration ciphertext management invariants on loop-carried dependencies. The basic idea is that because the loops are affine, you can enumerate the iteration space and track the noise growth per loop iteration, and use that to determine what to do next.

The simplest version of this pass would require the ciphertext dimension to be the same at the end of each loop iteration, and would require a bootstrap so that each iteration starts with freshly reset noise. A more sophisticated version could do two additional things: (a) determine how many initial loop iterations need to be peeled from the beginning or end of the loop when the invariant doesn't need to be maintained (e.g., this would solve the case of initialization of a loop carried dependency by a plaintext) and (b) determine how much to unroll a loop so that the invariants can be spread out more; e.g., you might only need to bootstrap after every three loop iterations.

github-actions · 2025-03-21T21:11:27Z

This issue has 2 outstanding TODOs:

lib/Transforms/ConvertToCiphertextSemantics/ConvertToCiphertextSemantics.cpp:720: consider emitting linalg ops and/or loops for the rotations
lib/Transforms/ConvertToCiphertextSemantics/ConvertToCiphertextSemantics.cpp:793: also hit this branch if the padding value is dont_care

This comment was autogenerated by todo-backlinks

github-actions · 2025-03-21T21:11:53Z

This issue has 2 outstanding TODOs:

lib/Transforms/ConvertToCiphertextSemantics/ConvertToCiphertextSemantics.cpp:1029: also hit this branch if the padding value is dont_care
lib/Transforms/ConvertToCiphertextSemantics/ConvertToCiphertextSemantics.cpp:954: consider emitting linalg ops and/or loops for the rotations

This comment was autogenerated by todo-backlinks

ZenithalHourlyRate changed the title ~~Loop support: prefer ilnalg over generic affine.for~~ Loop support: prefer linalg over generic affine.for Mar 16, 2025

ZenithalHourlyRate mentioned this issue Mar 19, 2025

Support different flavor of lowering in --mlir-to-<scheme> #1420

Open

ZenithalHourlyRate mentioned this issue Apr 2, 202 4FF7 5

Make LevelAnalysis/DimensionAnalysis aware of affine.for loop #1181

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Loop support: prefer `linalg` over generic `affine.for` #1569

Loop support: prefer `linalg` over generic `affine.for` #1569

Loop support: prefer linalg over generic affine.for #1569

Loop support: prefer linalg over generic affine.for #1569

Comments

affine.for usage

Data semantic

Common ciphertext-semantic patterns in current HEIR codebase

Case study for the matmul kernel

Analysis requirement

Discussion on additional cost

Discussion on lowering

Discussion on frontend

Loop support: prefer `linalg` over generic `affine.for` #1569

Loop support: prefer `linalg` over generic `affine.for` #1569

`affine.for` usage