8000 [Util] Fix intermittent crash in OptimizeIntArithmeticPass by jtuyls · Pull Request #20637 · iree-org/iree · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

[Util] Fix intermittent crash in OptimizeIntArithmeticPass #20637

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 25, 2025

Conversation

jtuyls
Copy link
Contributor
@jtuyls jtuyls commented Apr 25, 2025

We're seeing intermittent failures in 'sharktank_model_tests :: rocm_hip_w7900' tests on main:

The following issue contains more information on the crash and a reproducer: #20636

I am not entirely sure whether this PR contains a fix or a workaround. On one hand, it seems like correctly updating and invalidating all states manually in the transformations should lead to the same result without needing to recompute all states. On the other hand, the DataFlowSolver mentions that all analysis states should be erased when IR changes before a re-run (https://github.com/llvm/llvm-project/blob/12a31658ea36cda74157c6b4e6b6c031e39a19c0/mlir/include/mlir/Analysis/DataFlowFramework.h#L320):

/// Steps to re-run a data-flow analysis when IR changes:
/// 1. Erase all analysis states as they are no longer valid.
/// 2. Re-run the analysis using initializeAndRun.

Note: I didn't add a test because I don't have a deterministic reproducer. As the crash is intermittent, the reproducer needs to be run a large number of times and will fail once in 10-200 runs maybe.

Signed-off-by: Jorn Tuyls <jorn.tuyls@gmail.com>
@jtuyls jtuyls requested a review from benvanik as a code owner April 25, 2025 13:10
Copy link
Contributor
@krzysz00 krzysz00 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming this doesn't massively increase compile time, this is good as a workaround s d might even be a fix. Let's land it and sort things out later

(The entire silver structure is an awkward mess but not relevant here)

@jtuyls
Copy link
Contributor Author
jtuyls commented Apr 25, 2025

Assuming this doesn't massively increase compile time

Yeah, I was concerned about that as well and visually compared the compilation time of the previously failing test case (toy_llama) and it didn't feel any different. And the sharktank CI test wall clock times seem to be similar as well.

@krzysz00
Copy link
Contributor

Can you merge this or should I?

@jtuyls jtuyls merged commit d33dcb5 into iree-org:main Apr 25, 2025
153 of 162 checks passed
@jtuyls jtuyls deleted the fix-intermittent-cast-failure branch April 25, 2025 19:19
KyleHerndon pushed a commit to KyleHerndon/iree that referenced this pull request May 7, 2025
…20637)

We're seeing intermittent failures in 'sharktank_model_tests ::
rocm_hip_w7900' tests on main:
-
https://github.com/iree-org/iree/actions/runs/14600283542/job/40957129060?pr=20601
-
https://github.com/iree-org/iree/actions/runs/14636457142/job/41069048559
-
https://github.com/iree-org/iree/actions/runs/14630167718/job/41051068642

The following issue contains more information on the crash and a
reproducer: iree-org#20636

I am not entirely sure whether this PR contains a fix or a workaround.
On one hand, it seems like correctly updating and invalidating all
states manually in the transformations should lead to the same result
without needing to recompute all states. On the other hand, the
`DataFlowSolver` mentions that all analysis states should be erased when
IR changes before a re-run
(https://github.com/llvm/llvm-project/blob/12a31658ea36cda74157c6b4e6b6c031e39a19c0/mlir/include/mlir/Analysis/DataFlowFramework.h#L320):

> /// Steps to re-run a data-flow analysis when IR changes:
/// 1. Erase all analysis states as they are no longer valid.
/// 2. Re-run the analysis using `initializeAndRun`.

Note: I didn't add a test because I don't have a deterministic
reproducer. As the crash is intermittent, the reproducer needs to be run
a large number of times and will fail once in 10-200 runs maybe.

Signed-off-by: Jorn Tuyls <jorn.tuyls@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0