8000 Benchmarks are compiled with -O2 and debug options · Issue #407 · linbox-team/fflas-ffpack · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Benchmarks are compiled with -O2 and debug options #407

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
dlesnoff opened this issue Mar 21, 2025 · 3 comments
Open

Benchmarks are compiled with -O2 and debug options #407

dlesnoff opened this issue Mar 21, 2025 · 3 comments
Labels

Comments

@dlesnoff
Copy link

This issue simply asks for a clarification (I can not add a label like "documentation issue" myself).

In the benchmarks's Makefile, one can find:

FLASFFPACK_CXXFLAGS =  -O2 -march=native -Wall -DNDEBUG -UDEBUG

Is there a reason to compile with -O2 and not with -O3?

Are the -DNDEBUG and -UDEBUG necessary flags? May it hinder performance?

@vneiger
Copy link
Member
vneiger commented Mar 21, 2025

Others should be able to say more than I will, but at least one thing is that O3 is not always beneficial (it can actually decrease performance). If SIMD and loop unrolling are already reasonably well exploited in the code, it is not clear to me what gain would be expected from O3 (did you have anything specific in mind?).

@dlesnoff
Copy link
Author
dlesnoff commented Mar 21, 2025

I noticed low performance of benchmark-fgemm on unbalanced dimensions for the matrix product. If $m, k$ and $n$ are the dimensions of the input matrices, with $m=10923$, $k=32768$ and $n=32$, I obtained much less performance (~30 GFops with (un)balanced double arithmetic and a small enough bitsize that is $\textrm{bitsize}(p) \leq 21$) with the benchmarks/benchmark-fgemm.C script than my custom implementation on a slower CPU (~500 GFops with unbalanced double arithmetic and a similar bitsize on an Intel Xeon 6248).

Image

I am asking myself these questions:
1.Is it a compilation issue? (-O2 instead of -O3?) (it suffices to change the Makefile).< 83F4 br> 2. Are the dimensions problematic for OpenBLAS? (I have some OpenBLAS dgemm benchmarks already written).
3. Is too much time spent into searching the maximum norm of the input matrices?

I can easily answer the two first questions, and will provide information tomorrow.
Sadly, I do not know how to activate subtimers in FFLAS. Answering the third question will be harder.
If anything can be done to enable internal timers or to profile easily the program, it would tremendously help me!e

EDIT: A few runs of OpenBLAS confirmed that the problem described above come from OpenBLAS performance and not from FFLAS performance.

@vneiger
Copy link
Member
vneiger commented Mar 23, 2025

On my side, using AMD-BLIS (https://github.com/amd/blis) there also is some slowdown when going towards unbalanced dimensions, but nothing that looks too surprising to me. For example:

8 threads:
Time: 0.732497 Gfops: 341.298 -q 131071 -m 5000 -k 5000 -n 5000 -w -1 -i 10 -p 1 -t 8 -b 8
Time: 0.13199 Gfops: 145.466 -q 131071 -m 10000 -k 30000 -n 32 -w -1 -i 10 -p 1 -t 8 -b 8

1 thread:
Time: 3.36598 Gfops: 74.2725 -q 131071 -m 5000 -k 5000 -n 5000 -w -1 -i 10 -p 0 -t 1 -b 1
Time: 0.354903 Gfops: 54.0993 -q 131071 -m 10000 -k 30000 -n 32 -w -1 -i 10 -p 0 -t 1 -b 1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants
0