You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
There are a bunch of intrinsics in the rocdl dialect for doing scaled conversion to/from the fp4/6/8 types - it's all the ones with scale32 in their name (though not the sr ones - those are stochiastic rounding, wich we don't use). However, they use different intrinsics for different types and have somewhat funky calling conventions.
In the AMDGPU dialect, we currently have operations like amdgpu.ext_packed_fp8 and packed_trunc_2xfp8 for regular conversions to/from fp8, which use the types of the input and output to distinguish the operation being performed.
We should add wrapper operations around those intrinsics for the scaled cases, which also have the implicit "pad with undef" semantics. For scaling_extf, we can just take up to (32 for 6-bit, 4 for 8-bit, 8 for 4-bit) elements and a selector index that picks out the relevant byte (in all but the 6-bit case)). We might need a special operation for the scaling f8 => f16 operations, which have a tied input, unlike the other extf-likes.
For the truncation operation, we'll likely want to have an operation for all the tied input cases (where you select a byte of the output to place the result into), and one for the 6-bit cases where it just does the truncation.
These ops should lower to the ROCDL intrinsics following the existing fp8 operation patterns.
The text was updated successfully, but these errors were encountered:
There are a bunch of intrinsics in the rocdl dialect for doing scaled conversion to/from the fp4/6/8 types - it's all the ones with
scale32
in their name (though not thesr
ones - those are stochiastic rounding, wich we don't use). However, they use different intrinsics for different types and have somewhat funky calling conventions.In the AMDGPU dialect, we currently have operations like
amdgpu.ext_packed_fp8
andpacked_trunc_2xfp8
for regular conversions to/from fp8, which use the types of the input and output to distinguish the operation being performed.We should add wrapper operations around those intrinsics for the scaled cases, which also have the implicit "pad with undef" semantics. For
scaling_extf
, we can just take up to (32 for 6-bit, 4 for 8-bit, 8 for 4-bit) elements and a selector index that picks out the relevant byte (in all but the 6-bit case)). We might need a special operation for the scaling f8 => f16 operations, which have a tied input, unlike the other extf-likes.For the truncation operation, we'll likely want to have an operation for all the tied input cases (where you select a byte of the output to place the result into), and one for the 6-bit cases where it just does the truncation.
These ops should lower to the ROCDL intrinsics following the existing fp8 operation patterns.
The text was updated successfully, but these errors were encountered: