-
Notifications
You must be signed in to change notification settings - Fork 24.7k
Implement MKLGenerator #151218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Implement MKLGenerator #151218
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151218
Note: Links to docs will display an error until the docs builds have been completed. ✅ You can merge normally! (1 Unrelated Failure)As of commit c91e817 with merge base cd995bf ( UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:
This comment was automatically generated by Dr. CI and updates every 15 minutes. |
@pytorchbot label "topic: not user facing" |
@pytorchbot label "ciflow/linux-aarch64" |
To add the ciflow label This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
@pytorchbot label "ciflow/linux-aarch64" |
{ | ||
// See Note [Acquire lock when using random generators] | ||
std::lock_guard<std::mutex> lock(mklGenerator->mutex_); | ||
mklGenerator->get_stream_copy(stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't get_stream_copy thread-safe ?
Those failures can not be a coincidence, are they? |
Hi @michalowski-arm @CaoE, the CI failure seems related. Could you take a look? |
@leslie-fang-intel Of course, and I am glad to help promote the progress of this PR. |
At least some of the failures are due to not handling |
@pytorchbot label "ciflow/linux-aarch64" |
To add these label(s) (ciflow/linux-aarch64) to the PR, please first approve the workflows that are awaiting approval (scroll to the bottom of this page). This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows. |
There are some errors in |
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Successfully rebased |
2447019
to
4c664fe
Compare
@michalowski-arm can we try reproducing this locally on an x86 machine? You can pull the docker image from |
Yes, I've been looking into it |
@michalowski-arm I downloaded the image locally It seems that CI requires an upgraded version of MKL. |
@CaoE thank you for your insights!
do you guys have a preference as to which version we should update to? |
@fadara01 @abullabib I tried MKL 2025.1 and 2024.2 (used in torch release 2.7), and they can fix the error. Let's see the result of #154198. |
@malfet I have tired to upgrade MKL to 2024.2.0 in CI #154198, but there are many "Inconsistent configuration parameters" errors appears. The new MKL has made some changes to DFTI. The MKL version in pytorch releases is inconsistent with the MKL used by CI, which causes these errors to not be triggered in CI. |
I added the fix in this PR #154198. |
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Rebase failed due to Command
Raised by https://github.com/pytorch/pytorch/actions/runs/15319034371 |
@pytorchbot rebase |
@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here |
Rebase failed due to Command
Raised by https://github.com/pytorch/pytorch/actions/runs/16070267154 |
* Implements MKLGeneratorImpl which uses MKL/OpenRNG to generate random variates and keeps a consistent global state. * Links MKLGenerator state change to CPUGenerator state changes
e9a6b42
to
c91e817
Compare
@nikhil-arm @fadara01 @agrawal-aka @malfet @CaoE The PR now passes all checks, could I get a review |
CI is passing, review comments are addressed. 1 Ci failure seems to be unrelated. |
This PR aims to fix the issue from #132395 by implementing a new
MKLGeneratorImpl
that stores a consistent, globalvslStream
for use in random numbers generation. This path was previously disabled due to a problem of repeating variates, caused by repeated reseeding of the MKL generator with variates from theCPUGenerator
. This new implementation only seeds theMKLGenerator
once using theCPUGenerator
, and then keeps reusing the samevslStream
, providing the full period of the RNG.For the sake of reproducibility, the saving and restoring of the
MKLGenerator
has been linked toCPUGenerator
state changes, and the former does not provide its ownget_state()
andset_state()
functionality. The point was to keep the user experience identical to before -- they do not need to handle a separateMKLGenerator
explicitly.There already exists a test to check for repetition based on the script from #132395. It can be found
test_distribution.py
astest_multinomial_sequential_draw()
. For the old (reseeded) implementation of the MKLvslStream
, this test showed 21 repetitions. For this new implementation, the test gives 0 repetitions as expected.cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames