-
Notifications
You must be signed in to change notification settings - Fork 645
perf: Make mempool update async from block.Commit #3008
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
perf: Make mempool update async from block.Commit #3008
Conversation
state/execution.go
Outdated
return res.RetainHeight, err | ||
if err != nil { | ||
blockExec.logger.Error("client error during mempool.Update, flushing mempool", "err", err) | ||
blockExec.mempool.Flush() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is new, right? Why it's important to flush the mempool?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before we would halt, sorry forgot to flag in a comment. I feel like we only have two options, dump the mempool or replicate existing functionality and make an error channel to halt consensus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any caller of function Commit
in line 395 above will panic if Commit
doesn't return a nil
error. This is for safety and has been like that for years. I'd feel more comfortable if we didn't modify that behavior here, so if err != nil
in line 445, I'd panic here with the error text of line 446.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
What do you mean? In any case, it is probably |
One problem of this approach is that this breaks the contract that we have regarding the ABCI https://docs.cometbft.com/v0.38/spec/abci/abci++_app_requirements#commit
By updating the mempool in parallel, this contract is clearly broken. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer to block this change while we address how we are handling the contract with the application as part of the ABCI Commit call.
state/execution.go
Outdated
if updateData.err != nil { | ||
return | ||
} | ||
|
||
err = blockExec.mempool.Update( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This part we can do in parallel.
Once the new state is committed, the application can start performing the re-check of all transactions remaining in the mempool, using the updated state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moreover, it is not clear to me why we need to call Update
with a locked mempool...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, this is due to the current FIFO guarantee, before you add a new tx. (You want to know your done rechecking, before you add more txs to the end)
This really should be a lock owned within the mempool though, rather than here.
After your fix of flushing the mempool before Commit, we are only breaking the contract in "When Commit returns, CometBFT unlocks the mempool.". We do flush and take a lock during Commit. We just now hold the lock after Commit terminates. I'm not sure that application behavior would depend on this. |
I'm still stuck on the create empty blocks tests. Going to try more tomorrow. But somehow playing around with the timeouts / expanding leniency isn't fixing this. (Testing via running Apriori I'd just guess maybe we shaved 10ms or something in this configuration in edge cases. Going to try more with a debugger tmrw. |
@hvanz could you please help @ValarDragon here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I've also run the tests and they all pass.
Regarding the ABCI spec on Commit, I don't think this change will pose a problem. The important part of that paragraph is "Before invoking Commit, CometBFT locks the mempool and flushes the mempool connection.". This is not broken.
But the line "When Commit returns, CometBFT unlocks the mempool." should be updated to reflect the new behaviour. I think it doesn't matter for the ABCI connections that the mempool is still locked after Commit returns.
I think the failing test cases is due to some race condition bug in the timeout ticker relating to ticks with negative durations. I added some logs, and in the passing cases I see logs like:
where the left number is In the failing test cases I see:
(So were not getting the 200ms wait) Trying to figure out whats the cause of this edge case. I think this has to be independent of my change though. |
Ok just confirmed I get this same failure on main when I run the test in a loop, so its not new to this PR (and is an edge case failure in the timeout ticker) Going to file this into an issue, and consider the test failure here in this PR (vs not main) as a fluke. Will try to fix that separately in another PR. EDIT: Filed into issue / PR, #3091 / #3092 . I think that should not block this PR, as this behavior exists in the same manner. Or perhaps we merge that in first, then rebase into this PR? |
I favour that approach |
state/execution.go
Outdated
state State | ||
abciResponse *abci.FinalizeBlockResponse | ||
} | ||
|
||
// Commit locks the mempool, runs the ABCI Commit message, and updates the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to update this comment?
Co-authored-by: Sergio Mena <sergio@informal.systems>
First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Sergio Mena <sergio@informal.systems>
First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>
First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems> (cherry picked from commit 2cea495)
… (#71) First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems> (cherry picked from commit 2cea495) Co-authored-by: PaddyMc <paddymchale@hotmail.com>
I just want to say that this actually, if it works how I think it will work, looks like it may be a very significant piece of solving the P2P storms puzzle. |
First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Sergio Mena <sergio@informal.systems>
An update regarding this. We have finally decided to backport this to
So, benefits clearly outweigh the risks at this point. |
First step to fixing #2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in #2925) - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- #### PR checklist - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Sergio Mena <sergio@informal.systems> (cherry picked from commit 1c277c0) # Conflicts: # state/execution.go
This reverts commit 0246f4d.
First step to fixing #2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Sergio Mena <sergio@informal.systems>
…3362) First step to fixing #2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in #2925) - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- #### PR checklist - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3008 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>
First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Sergio Mena <sergio@informal.systems>
First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>
…3008) (cometbft#3362) First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Ha 103B5 ppy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3008 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>
…3008) (cometbft#3362) First step to fixing cometbft#2925 PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that: - Full nodes and non-proposers `timeout_prevote` beginning should not block on updating the mempool - Block proposers get _very slight_ increased concurrency before reaping their next block. (Should be significantly fixed in subsequent PR's in - Reap takes a lock on the mempool mutex, so there is no concurrency safety issues right now. - Mempool errors will not halt consensus, instead they just log an error and call mempool flush. I actually think this may be better behavior? If we want to preserve the old behavior, we can thread a generic "consensus halt error" channel perhaps? I'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets. --- - [ ] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3008 done by [Mergify](https://mergify.com). --------- Co-authored-by: Dev Ojha <ValarDragon@users.noreply.github.com> Co-authored-by: Sergio Mena <sergio@informal.systems>
perf: Make mempool update async from block.Commit (backport cometbft#3008)
First step to fixing #2925
PR'ing this to see if we have any test failures. Note that this is safe in the happy path, as Reap and CheckTx both share this same lock. The functionality behavior is that:
timeout_prevote
beginning should not block on updating the mempoolI'm not sure how/where to best document this. Please also let me know if tests need creating. Seems like the create empty block tests sometimes hit failures, I'll investigate tmrw
Also please feel free to take over this PR, just thought I"d make it to help us with performance improvements. Happy to get this into an experimental release to test on mainnets.
PR checklist
.changelog
(we use unclog to manage our changelog)docs/
orspec/
) and code comments