Intel Mesa Code Lands Big Patch Series For Treating Convergent Values As SIMD8

phoronix

Administrator

Join Date: Jan 2007

Posts: 66917
- Share
- Tweet
#1

Intel Mesa Code Lands Big Patch Series For Treating Convergent Values As SIMD8

25 December 2024, 07:50 AM

Phoronix: Intel Mesa Code Lands Big Patch Series For Treating Convergent Values As SIMD8

A patch series six months in the making and consisting of 24 patches by longtime Intel Linux graphics engineer Ian Romanick was merged on Christmas Eve for Mesa 25.0...

Intel Mesa Code Lands Big Patch Series For Treating Convergent Values As SIMD8 - Phoronix

https://www.phoronix.com/news/Intel-Mesa-Conv-SIMD8

Phoronix, Linux Hardware Reviews, Linux hardware benchmarks, Linux server benchmarks, Linux benchmarking, Desktop Linux, Linux performance, Open Source graphics, Linux How To, Ubuntu benchmarks, Ubuntu hardware, Phoronix Test Suite
Tags: None
Milol

Junior Member

Join Date: Dec 2024

Posts: 1
- Share
- Tweet
#2

25 December 2024, 08:34 AM

What does convergent values mean in this context?
Comment
farnz

Junior Member

Join Date: Jun 2013

Posts: 17
- Share
- Tweet
#3

25 December 2024, 11:23 AM

In this context, a value is "convergent" if it has the same value in every SIMD lane.

Intel GPUs can execute 8, 16 or 32 lanes wide SIMD; before this change, convergent values had to be stored in a register as wide as SIMD execution. After this change, they can be stored in an 8 lane register, and the GPU will automatically replicate it twice or four times for operations on 16 or 32 lanes.
Likes 8
Comment
coder

Senior Member

Join Date: Nov 2014

Posts: 8788
- Share
- Tweet
#4

25 December 2024, 04:32 PM

Do such Mesa enhancements automatically benefit Rusticle?
Comment
coder

Senior Member

Join Date: Nov 2014

Posts: 8788
- Share
- Tweet
#5

25 December 2024, 04:40 PM

Originally posted by farnz View Post

In this context, a value is "convergent" if it has the same value in every SIMD lane.

I assume this distinction is a static determination, no?

Originally posted by The Patch

Our register allocator is not clever enough to handle scalar allocations. It's fundamental unit of allocation is SIMD8. Start treating convergent values as SIMD8.

Is this a hardware or software limitation? It still seems awfully wasteful to burn 256 bits on a 32-bit scalar, but I get that it's a lot better than 512 or 1024 bits.

So, does the hardware actually have scalar registers and they're just not being used?

Originally posted by farnz View Post

Intel GPUs can execute 8, 16 or 32 lanes wide SIMD; before this change, convergent values had to be stored in a register as wide as SIMD execution. After this change, they can be stored in an 8 lane register, and the GPU will automatically replicate it twice or four times for operations on 16 or 32 lanes.

Thanks for the explanation. When you say "the GPU will automatically replicated it", does that mean adding a SIMD8 vector to a SIMD32 vector will cause the SIMD8 operand automatically to get replicated to match the width of the larger operand?
Comment
Gamer1227

Junior Member

Join Date: Mar 2024

Posts: 47
- Share
- Tweet
#6

25 December 2024, 10:06 PM

Originally posted by coder View Post

I assume this distinction is a static determination, no?

This is a patch for the Intel compiler graphics as the article says, so i am pretty sure it is static, the compiler will check wich values are convergent and put that information on the machine code.

Originally posted by coder View Post

Is this a hardware or software limitation? It still seems awfully wasteful to burn 256 bits on a 32-bit scalar, but I get that it's a lot better than 512 or 1024 bits.

Software, register allocation is a function of compilers, hardware dont decide wich registers are used in a instruction.

the architecture of the compiler can only allocate register in groups of 8, they could change it, but that would require to change lots of code.

Originally posted by coder View Post

So, does the hardware actually have scalar registers and they're just not being used?

Modern GPUs have scalar ALUs, but not scalar registers, they are shared with SIMD ALUs.

Originally posted by coder View Post

Thanks for the explanation. When you say "the GPU will automatically replicated it", does that mean adding a SIMD8 vector to a SIMD32 vector will cause the SIMD8 operand automatically to get replicated to match the width of the larger operand?

I have not seen the code to see how it is implemented.

But it could be like that, store convergent values in a SIMD8 to reduce register usage, then when using it in a instruction, just copy the value to more registers until you have a full SIMD32.
Comment
Kayden

Intel

Join Date: Aug 2011

Posts: 308
- Share
- Tweet
#7

26 December 2024, 12:40 AM

Originally posted by coder View Post

Is this a hardware or software limitation? It still seems awfully wasteful to burn 256 bits on a 32-bit scalar, but I get that it's a lot better than 512 or 1024 bits.

So, does the hardware actually have scalar registers and they're just not being used?

No...all registers are 256-bit (8 lanes at 32-bit) until Xe2 (Lunarlake/Battlemage) when they start being 512-bit (16 lanes at 32-bit). You've got the right of it—using 8 lanes for a scalar is still pretty wasteful, but a lot better than wasting 16 or 32 lanes. Eventually, we plan to do SIMD1 scalars, where we use only 1 lane, and can pack things more tightly.

But, as an incremental step, this let us figure out which values are convergent, teach consumers to handle that, and begin taking advantage of the information. We can then look at allocating scalars more efficiently in a second step.

Originally posted by coder View Post

Thanks for the explanation. When you say "the GPU will automatically replicated it", does that mean adding a SIMD8 vector to a SIMD32 vector will cause the SIMD8 operand automatically to get replicated to match the width of the larger operand?

Most instructions can implicitly replicate a scalar source out to all the lanes. For example,

Code:

add(16) r2<1>UD r4<16,16,1>UD r8.7<0,1,0>UD

Would add the 16 lanes in register r4 with r8's single lane 7 and store that in r2. This is effectively free.

Free Software Developer .:. Mesa and Xorg
Opinions expressed in these forum posts are my own.
Likes 6
Comment

Announcement

Intel Mesa Code Lands Big Patch Series For Treating Convergent Values As SIMD8

Intel Mesa Code Lands Big Patch Series For Treating Convergent Values As SIMD8

Comment

Comment

Comment

Comment

Comment

Comment