8000 fix: TimeoutTicker returns wrong value/timeout pair when timeouts are scheduled at ~approximately the same time (backport #3092) by mergify[bot] · Pull Request #3106 · cometbft/cometbft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

fix: TimeoutTicker returns wrong value/timeout pair when timeouts are scheduled at ~approximately the same time (backport #3092) #3106

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
May 22, 2024

Conversation

mergify[bot]
Copy link
Contributor
@mergify mergify bot commented May 22, 2024

#3091

The problem is we have an edge case where we should drain the timer channel, but we "let it slide" in certain race conditions when two timeouts are scheduled near each other. This means we can have unsafe timeout behavior as demonstrated in the github issue, and likely more spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from the same thread. In NewTimer we can block until the timer is drained (very quickly up to goroutine scheduling). In OnStop we don't need to guarantee draining before the method ends, we can just launch something into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive" variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.


PR checklist

  • Tests written/updated
  • Changelog entry added in .changelog (we use unclog to manage our changelog)
  • Updated relevant documentation (docs/ or spec/) and code comments
  • Title follows the Conventional Commits spec

This is an automatic backport of pull request #3092 done by [Mergify](https://mergify.com).

… scheduled at ~approximately the same time (#3092)

#3091

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.

---

#### PR checklist

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

(cherry picked from commit 153281a)

# Conflicts:
#	.changelog/v0.38.3/bug-fixes/3092-consensus-timeout-ticker-data-race.md
#	consensus/ticker.go
#	consensus/ticker_test.go
@mergify mergify bot requested a review from a team as a code owner May 22, 2024 13:35
@mergify mergify bot added the conflicts label May 22, 2024
Copy link
Contributor Author
mergify bot commented May 22, 2024

Cherry-pick of 153281a has failed:

On branch mergify/bp/v0.38.x/pr-3092
Your branch is up to date with 'origin/v0.38.x'.

You are currently cherry-picking commit 153281af6.
  (fix conflicts and run "git cherry-pick --continue")
  (use "git cherry-pick --skip" to skip this patch)
  (use "git cherry-pick --abort" to cancel the cherry-pick operation)

Unmerged paths:
  (use "git add <file>..." to mark resolution)
	added by them:   .changelog/v0.38.3/bug-fixes/3092-consensus-timeout-ticker-data-race.md
	both modified:   consensus/ticker.go
	added by them:   consensus/ticker_test.go

no changes added to commit (use "git add" and/or "git commit -a")

To fix up this pull request, you can check it out locally. See documentation: https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/reviewing-changes-in-pull-requests/checking-out-pull-requests-locally

@sergio-mena sergio-mena self-assigned this May 22, 2024
@sergio-mena sergio-mena added the bug Something isn't working label May 22, 2024
sergio-mena and others added 2 commits May 22, 2024 15:39
…outs are scheduled at ~approximately the same time (#3092)"

This reverts commit 2d074fc.
… scheduled at ~approximately the same time (#3092)

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.

---

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
@sergio-mena sergio-mena merged commit 01ca424 into v0.38.x May 22, 2024
21 checks passed
@sergio-mena sergio-mena deleted the mergify/bp/v0.38.x/pr-3092 branch May 22, 2024 14:34
sergio-mena added a commit that referenced this pull request May 22, 2024
… scheduled at ~approximately the same time (backport #3092) (#3106)

The problem is we have an edge case where we should drain the timer
channel, but we "let it slide" in certain race conditions when two
timeouts are scheduled near each other. This means we can have unsafe
timeout behavior as demonstrated in the github issue, and likely more
spots in consensus.

Notice that aside from NewTimer and OnStop, all timer accesses are from
the same thread. In NewTimer we can block until the timer is drained
(very quickly up to goroutine scheduling). In OnStop we don't need to
guarantee draining before the method ends, we can just launch something
into the channel that will kill it.

In the main timer goroutine, we can safely maintain this "timerActive"
variable, and force drain when its active. This removes the edge case.

The test I created does fail on main.

---

- [X] Tests written/updated
- [x] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [x] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #3092 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com>
Co-authored-by: Sergio Mena <sergio@informal.systems>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0