8000 Loop support: prefer `linalg` over generic `affine.for` · Issue #1569 · google/heir · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Loop support: prefer linalg over generic affine.for #1569

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
ZenithalHourlyRate opened this issue 8000 Mar 16, 2025 · 4 comments
Open

Loop support: prefer linalg over generic affine.for #1569

ZenithalHourlyRate opened this issue Mar 16, 2025 · 4 comments

Comments

@ZenithalHourlyRate
Copy link
Collaborator

HEIR has supported affine.for, but in the eyes of dataflow analysis, generic affine.for with arbitrary body is hard to understand (understanding loop has always been one of the hard part of compiler). I propose using linalg to describe what we have now.

I further argue (not 100% sure) that any affine.for that could be understood/accepted by FHE-specific analysis can always be rewritten in linalg with primitives like elementwise/reduce, as the task of the analysis is to understand what part of affine.for's body is elementwise and how the loop-carried variable forms a reduction. The is somewhat similar to the map-reduce programming model.

affine.for usage

Semantically there are two places we use affine.for

  1. User input (data semantic), like roberts_cross/box_blur where the algorithm is naturally described in for loops
  2. tensors of ciphertext (ciphertext semantic), like halevi-shoup matmul.

Data semantic

For data semantic, we already need to mitigate it in some way.

  1. As we are already using full-loop-unroll and heir-vectorizer to handle these loops, further lowering into the FHE world should not see them
  2. Alternatively, the layout packing pass is responsible for transforming the data semantic loop into ciphertext-semantic loop, then that pass could also rewrite it in linalg.

So the problem reduces to loop support in ciphertext semantic.

Common ciphertext-semantic patterns in current HEIR codebase

There are quite a lot of pattern like

affine.for tensor<NxMx!ct>
  tensor.extract one element
  computation
  tensor.insert into loop-carried variable

or

affine.for memref
  memref.load one element
  computation
  memref.store into memory

that is actually elementwise op in input + reduction on the loop carried variable. Instead of letting the lowering emit such fused loop and let analysis divide them apart, we can divide them apart early like

%res_tensor = linalg.elementwise_binary { body }
%res_scalar = linalg.reduce %res_tensor

Case study for the matmul kernel

The halevi-shoup lowering should lower from data-semantic linalg.matmul to cipher-text-semantic combinations of linalg.some_op.

The current lowering in main is like

      %2 is the reduction variable (result)
      %input0 is the input vector x
      %arg3 is progressive rotation of x
      %3:2 = affine.for %arg1 = 1 to 1024 iter_args(%arg2 = %2, %arg3 = %input0) -> (tensor<1x1024xf32>, tensor<1x1024xf32>) {
        %4 = tensor_ext.rotate %arg3, %c1 : tensor<1x1024xf32>, index
        %extracted_slice = tensor.extract_slice %cst_0[%arg1, 0] [1, 1024] [1, 1] : tensor<1024x1024xf32> to tensor<1x1024xf32>
        %5 = arith.mulf %4, %extracted_slice : tensor<1x1024xf32>
        %6 = arith.addf %arg2, %5 : tensor<1x1024xf32>
        affine.yield %6, %4 : tensor<1x1024xf32>, tensor<1x1024xf32>
      }

The elementwise part of it is clearly mulf (with tensor splatting), and reduction is addf, but the rotate part is quite hard to understand. In analysis we have to write an individual rule for it to know that it is progressive rotation, and NoiseAnalysis is quite unhappy with it, see #1517 (comment)

Instead we could have

%rotated_tensor_of_input = linalg.elementwise { rotate } on %input_splatted
%mul_result = linalg.elementwise { mulf } on %weight and %rotated_tensor_of_input
%result = linalg.reduce %mul_result

And the %rotated_tensor_of_input can be produced in many ways like 8-4-2-1 or BSGS (32-16-8-4/3,2,1) way instead of rotate by each index / progressively rotate.

Analysis requirement

There are these following types of analysis in HEIR FHE world that must properly handle loop to have correct lowering

  1. Secretness Analysis (for whether it is secret or not)
  2. Level analysis (for RNS level)
  3. Dimension Analysis (for ciphertext size)
  4. Scale Analysis (for CKKS/BGV scale management)
  5. Noise Analysis (for correctly determining noise bound and parameter selection)

With linalg clearly stating the batch behavior instead of generic affine.for, the lives of these analyses could be much easier. Otherwise each of them need to separate the loop into different semantic part and understand it (quick reflection: how could analysis understand the rotate %c1 in halevi-shoup kernel above)

Another point that supports linalg is that, as if is hard to analyse, we already have convert-if-to-select so it becomes something like .filter in map-reduce language. And it can be expressed in linalg.elementwise.

Discussion on additional cost

Separate them forms much bigger memory cost, as what map-reduce would do. For now all the partial results are stored instead of being reduced immediately.

We should seek optimization from the loop-transformation world like fusion/tiling, after the FHE lowering. I think this is quite standard way like semantic first and optimization later instead of writing an optimized affine.for early.

Discussion on lowering

We eventually need to lower it. I think that is at the boundary of the backend where linalg is expressed in c++ for.

Discussion on frontend

If such reasoning above is acceptable, we might encourage frontend user / kernel writer to use map-reduce-ish language / functional programming way to express things instead of for loop as the former would be much semantically clear.

@ZenithalHourlyRate ZenithalHourlyRate changed the title Loop support: prefer ilnalg over generic affine.for Loop support: prefer linalg over generic affine.for Mar 16, 2025
@asraa
Copy link
Collaborator
asraa commented Mar 17, 2025

%rotated_tensor_of_input = linalg.elementwise { rotate } on %input_splatted

+1 on this too. I noticed in the MLP example when I had the for loops in, the noise would grow more because of the iterative rotation by 1. It would be nicer to be able to lower and implement the rotations in a custom way (e.g. by the initial + 1, initial + 2, initial + 3, etc for minimal noise growth).

@j2kun
Copy link
Collaborator
j2kun commented Mar 17, 2025

After discussing this in office hours, I think we came to the following agreement:

  1. We should prefer using linalg ops when possible, because they are simpler for the noise analysis to understand.
  2. We want to try hard to make the noise analysis as generic as possible, in the interest of supporting general user input programs even in worst-case settings where the noise analysis is completely helpless. This means we should find a way to support loops in generality.
  3. Even with linalg ops, there are noise-impactful trade-offs that may not be easily expressed in linalg ops. For example, with Halevi-Shoup you need to perform n distinct rotations. You could do a linalg.elementwise where one operand is the (replicated) vector you want to rotate and one operand is the range {0..n-1} of shifts required. However, that can materialize in many different ways, the two extremes being (a) generate all n rotation keys and do the rotations in parallel and (b) generate the single rotation key for rotation by 1 and apply the rotations iteratively. (a) uses more key material but would be both faster and incur less noise growth, while (b) incurs the most noise growth but uses the least key material. Somewhere in between are techniques like baby-step-giant-step.

For (2) we talked about the idea of making a custom pass that handles affine.for loops and enforces iteration ciphertext management invariants on loop-carried dependencies. The basic idea is that because the loops are affine, you can enumerate the iteration space and track the noise growth per loop iteration, and use that to determine what to do next.

The simplest version of this pass would require the ciphertext dimension to be the same at the end of each loop iteration, and would require a bootstrap so that each iteration starts with freshly reset noise. A more sophisticated version could do two additional things: (a) determine how many initial loop iterations need to be peeled from the beginning or end of the loop when the invariant doesn't need to be maintained (e.g., this would solve the case of initialization of a loop carried dependency by a plaintext) and (b) determine how much to unroll a loop so that the invariants can be spread out more; e.g., you might only need to bootstrap after every three loop iterations.

Copy link
github-actions bot commented Mar 21, 2025

This issue has 2 outstanding TODOs:

This comment was autogenerated by todo-backlinks

Copy link

This issue has 2 outstanding TODOs:

This comment was autogenerated by todo-backlinks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0