Add AMDGPU dialect ops for scaled fp conversions #20890

krzysz00 · 2025-05-22T18:45:47Z

There are a bunch of intrinsics in the rocdl dialect for doing scaled conversion to/from the fp4/6/8 types - it's all the ones with scale32 in their name (though not the sr ones - those are stochiastic rounding, wich we don't use). However, they use different intrinsics for different types and have somewhat funky calling conventions.

In the AMDGPU dialect, we currently have operations like amdgpu.ext_packed_fp8 and packed_trunc_2xfp8 for regular conversions to/from fp8, which use the types of the input and output to distinguish the operation being performed.

We should add wrapper operations around those intrinsics for the scaled cases, which also have the implicit "pad with undef" semantics. For scaling_extf, we can just take up to (32 for 6-bit, 4 for 8-bit, 8 for 4-bit) elements and a selector index that picks out the relevant byte (in all but the 6-bit case)). We might need a special operation for the scaling f8 => f16 operations, which have a tied input, unlike the other extf-likes.

For the truncation operation, we'll likely want to have an operation for all the tied input cases (where you select a byte of the output to place the result into), and one for the 6-bit cases where it just does the truncation.

These ops should lower to the ROCDL intrinsics following the existing fp8 operation patterns.

The text was updated successfully, but these errors were encountered:

tgymnich · 2025-05-27T06:54:42Z

llvm/llvm-project#141554

krzysz00 mentioned this issue May 28, 2025

Work breakdown for MXFP enablement #20920

Open

23 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add AMDGPU dialect ops for scaled fp conversions #20890

Add AMDGPU dialect ops for scaled fp conversions #20890

Uh oh!

Add AMDGPU dialect ops for scaled fp conversions #20890

Add AMDGPU dialect ops for scaled fp conversions #20890

Comments

Uh oh!