10000 Implement MKLGenerator by michalowski-arm · Pull Request #151218 · pytorch/pytorch · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Implement MKLGenerator #151218

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 8 commits into
base: main
Choose a base branch
from

Conversation

michalowski-arm
Copy link
Contributor
@michalowski-arm michalowski-arm commented Apr 14, 2025

This PR aims to fix the issue from #132395 by implementing a new MKLGeneratorImpl that stores a consistent, global vslStream for use in random numbers generation. This path was previously disabled due to a problem of repeating variates, caused by repeated reseeding of the MKL generator with variates from the CPUGenerator. This new implementation only seeds the MKLGenerator once using the CPUGenerator, and then keeps reusing the same vslStream, providing the full period of the RNG.

For the sake of reproducibility, the saving and restoring of the MKLGenerator has been linked to CPUGenerator state changes, and the former does not provide its own get_state() and set_state() functionality. The point was to keep the user experience identical to before -- they do not need to handle a separate MKLGenerator explicitly.

There already exists a test to check for repetition based on the script from #132395. It can be found test_distribution.py as test_multinomial_sequential_draw(). For the old (reseeded) implementation of the MKL vslStream, this test showed 21 repetitions. For this new implementation, the test gives 0 repetitions as expected.

cc @jgong5 @mingfeima @XiaobingSuper @sanchitintel @ashokei @jingxu10 @jerryzh168 @voznesenskym @penguinwu @EikanWang @Guobing-Chen @zhuhaozhe @blzheng @wenzhe-nrv @jiayisunx @chenyang78 @kadeng @chauhang @amjames

@pytorch-bot pytorch-bot bot added the module: cpu CPU specific problem (e.g., perf, algorithm) label Apr 14, 2025
Copy link
pytorch-bot bot commented Apr 14, 2025

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/151218

Note: Links to docs will display an error until the docs builds have been completed.

✅ You can merge normally! (1 Unrelated Failure)

As of commit c91e817 with merge base cd995bf (image):

UNSTABLE - The following job is marked as unstable, possibly due to flakiness on trunk:

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@michalowski-arm
Copy link
Contributor Author

@michalowski-arm michalowski-arm changed the title [Draft] Implement MKLGenerator Implement MKLGenerator Apr 14, 2025
@michalowski-arm
Copy link
Contributor Author

@pytorchbot label "topic: not user facing"

@nikhil-arm
Copy link
Collaborator

@pytorchbot label "ciflow/linux-aarch64"

@pytorch-bot pytorch-bot bot added the ciflow/linux-aarch64 linux aarch64 CI workflow label Apr 14, 2025
@nikhil-arm nikhil-arm requested a review from digantdesai April 14, 2025 11:17
Copy link
pytorch-bot bot commented Apr 14, 2025

To add the ciflow label ciflow/linux-aarch64 please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@pytorch-bot pytorch-bot bot removed the ciflow/linux-aarch64 linux aarch64 CI workflow label Apr 14, 2025
@nikhil-arm nikhil-arm requested a review from malfet April 14, 2025 11:17
@nikhil-arm
Copy link
Collaborator

@pytorchbot label "ciflow/linux-aarch64"

@pytorch-bot pytorch-bot bot added the ciflow/linux- 8000 aarch64 linux aarch64 CI workflow label Apr 14, 2025
{
// See Note [Acquire lock when using random generators]
std::lock_guard<std::mutex> lock(mklGenerator->mutex_);
mklGenerator->get_stream_copy(stream);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't get_stream_copy thread-safe ?

@malfet
Copy link
Contributor
malfet commented Apr 15, 2025

Those failures can not be a coincidence, are they?

@mikaylagawarecki mikaylagawarecki added the triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module label Apr 16, 2025
@leslie-fang-intel
Copy link
Collaborator
leslie-fang-intel commented Apr 17, 2025

Hi @michalowski-arm @CaoE, the CI failure seems related. Could you take a look?

@CaoE
Copy link
Collaborator
CaoE commented Apr 17, 2025

@leslie-fang-intel Of course, and I am glad to help promote the progress of this PR.

@michalowski-arm
Copy link
Contributor Author

At least some of the failures are due to not handling torch.manual_seed appropriately. It resets CPUGenerator to the beginning of the input seed's sequence (and the tests are expecting that reset) but MKLGenerator is unaffected and just continues its old sequence.

@pytorch-bot pytorch-bot bot removed the ciflow/linux-aarch64 linux aarch64 CI workflow label Apr 17, 2025
@michalowski-arm
Copy link
Contributor Author

@pytorchbot label "ciflow/linux-aarch64"

Copy link
pytorch-bot bot commented Apr 17, 2025

To add these label(s) (ciflow/linux-aarch64) to the PR, please first approve the workflows that are awaiting approval (scroll to the bottom of this page).

This helps ensure we don't trigger CI on this PR until it is actually authorized to do so. Please ping one of the reviewers if you do not have access to approve and run workflows.

@CaoE
Copy link
Collaborator
CaoE commented Apr 21, 2025

There are some errors in test_distributions.py, for example, test_cdf_icdf_inverse. My local tests show that using VSL_BRNG_MT19937 can pass the test. There may be other tests that will fail and can be made to pass by modifying sample length.

@fadara01
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Successfully rebased mklrng-base onto refs/remotes/origin/viable/strict, please pull locally before adding more changes (for example, via git checkout mklrng-base && git pull --rebase)

< 6D40 /svg>

@fadara01
Copy link
Collaborator
fadara01 commented May 8, 2025

@michalowski-arm can we try reproducing this locally on an x86 machine?
It might give us a longer stack trace or more insights about the failure.

You can pull the docker image from pytorch/pytorch-linux-jammy-py3-clang15-asan as done here then download the build and test artifacts for linux-jammy-clang15-asan from https://hud.pytorch.org/pr/151218
and then just follow the "Test" steps in the pipeline.

@michalowski-arm
Copy link
Contributor Author

@michalowski-arm can we try reproducing this locally on an x86 machine? It might give us a longer stack trace or more insights about the failure.

You can pull the docker image from pytorch/pytorch-linux-jammy-py3-clang15-asan as done here then download the build and test artifacts for linux-jammy-clang15-asan from https://hud.pytorch.org/pr/151218 and then just follow the "Test" steps in the pipeline.

Yes, I've been looking into it

@CaoE
Copy link
Collaborator
CaoE commented May 23, 2025

@michalowski-arm I downloaded the image locally docker pull ghcr.io/pytorch/ci-image:pytorch-linux-jammy-py3-clang15-asan-90e2f0da05deb1b09040f7c1ee78ad0f34911877, and reproduced the Intel MKL function load error: __vslNewStreamEx. as done here.
I found that the MKL 2021.4.0 used in the conda environment py_3.10 of this image is relatively old. I manually deleted the old mkl and pip install mkl-static mkl-include, and then found that the issue was gone.

It seems that CI requires an upgraded version of MKL.

@fadara01
Copy link
Collaborator

@CaoE thank you for your insights!

I found that the MKL 2021.4.0 used in the conda environment py_3.10

do you guys have a preference as to which version we should update to?

@CaoE
Copy link
Collaborator
CaoE commented May 23, 2025

@fadara01 @abullabib I tried MKL 2025.1 and 2024.2 (used in torch release 2.7), and they can fix the error. Let's see the result of #154198.

@CaoE CaoE mentioned this pull request May 23, 2025
@CaoE
Copy link
Collaborator
CaoE commented May 28, 2025

@malfet I have tired to upgrade MKL to 2024.2.0 in CI #154198, but there are many "Inconsistent configuration parameters" errors appears. The new MKL has made some changes to DFTI. The MKL version in pytorch releases is inconsistent with the MKL used by CI, which causes these errors to not be triggered in CI.
The torch FFT ops seem to need to be fixed for these "Inconsistent configuration parameters" issues #154477.

@CaoE
Copy link
Collaborator
CaoE commented May 28, 2025

I added the fix in this PR #154198.

@nikhil-arm
Copy link
Collaborator

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/151218/head returned non-zero exit code 1

Rebasing (1/7)
Rebasing (2/7)
Rebasing (3/7)
Rebasing (4/7)
Rebasing (5/7)
Rebasing (6/7)
Rebasing (7/7)
Auto-merging benchmarks/dynamo/pr_time_benchmarks/expected_results.csv
CONFLICT (content): Merge conflict in benchmarks/dynamo/pr_time_benchmarks/expected_results.csv
error: could not apply 4c664feeebd... Update expected benchmark results
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply 4c664feeebd... Update expected benchmark results

Raised by https://github.com/pytorch/pytorch/actions/runs/15319034371

@michalowski-arm
Copy link
Contributor Author

@pytorchbot rebase

@pytorchmergebot
Copy link
Collaborator

@pytorchbot started a rebase job onto refs/remotes/origin/viable/strict. Check the current status here

@pytorchmergebot
Copy link
Collaborator

Rebase failed due to Command git -C /home/runner/work/pytorch/pytorch rebase refs/remotes/origin/viable/strict pull/151218/head returned non-zero exit code 1

Rebasing (1/9)
Rebasing (2/9)
Rebasing (3/9)
Rebasing (4/9)
Rebasing (5/9)
Rebasing (6/9)
Rebasing (7/9)
Auto-merging benchmarks/dynamo/pr_time_benchmarks/expected_results.csv
CONFLICT (content): Merge conflict in benchmarks/dynamo/pr_time_benchmarks/expected_results.csv
error: could not apply 4c664feeebd... Update expected benchmark results
hint: Resolve all conflicts manually, mark them as resolved with
hint: "git add/rm <conflicted_files>", then run "git rebase --continue".
hint: You can instead skip this commit: run "git rebase --skip".
hint: To abort and get back to the state before "git rebase", run "git rebase --abort".
hint: Disable this message with "git config set advice.mergeConflict false"
Could not apply 4c664feeebd... Update expected benchmark results

Raised by https://github.com/pytorch/pytorch/actions/runs/16070267154

@michalowski-arm
Copy link
Contributor Author

@nikhil-arm @fadara01 @agrawal-aka @malfet @CaoE The PR now passes all checks, could I get a review

@nikhil-arm
Copy link
Collaborator

CI is passing, review comments are addressed. 1 Ci failure seems to be unrelated.
Ready for merge

92FE
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
module: cpu CPU specific problem (e.g., perf, algorithm) module: dynamo open source skip-url-lint topic: not user facing topic category triaged This issue has been looked at a team member, and triaged and prioritized into an appropriate module
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants
0