Open
Description
Describe the bug
Gather operation is 5 times slower when AVX2 instructions are allowed with -mavx2
.
To Reproduce
- Set up Google Benchmark project
- Disable CPU frequency scaling with
sudo cpupower frequency-set --governor performance
- Test the following code with
-O3
and-O3 -mavx2
:
void run(benchmark::State& state) {
float data[4] = {1, 2, 3, 4};
for(auto _ : state) {
eve::wide<float, eve::fixed<4>> vec = eve::gather(data, eve::wide<unsigned char, eve::fixed<4>>{2, 3, 0, 1});
benchmark::DoNotOptimize(vec);
}
}
BENCHMARK(run);
BENCHMARK_MAIN();
Without -mavx2
:
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
run 0.198 ns 0.198 ns 3315565533
With -mavx2
:
-----------------------------------------------------
Benchmark Time CPU Iterations
-----------------------------------------------------
run 1.01 ns 1.01 ns 648857193
Setup:
- Compiler: g++ 14.2.1, clang++ 19.1.7
- OS: Gentoo Linux
- CPU: Ryzen 9 7940HS
- Instructions Set used: SSE, AVX2