-
Notifications
You must be signed in to change notification settings - Fork 645
mempool: panic when app's CheckTx returns an error while rechecking #2225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I've opened a hotfix PR (#2228) to tackle an immediate panic issue. However, a deeper solution is warranted, particularly concerning how the mempool processes the errors introduced recently. There's currently ambiguity in determining when to use CheckTxResponse.Code=1 versus returning an error. The critical question is: what role does returning an error play? In any case, encountering an error during a transaction's execution should not halt the recheck cursor's progression. Therefore, this PR should be valid and it directly addresses our problem. |
Are we addressing this? I agree with the previous comment that there is no proper semantic for |
Sorry for chiming in so late here. I apologize.
Which alternative to follow depends on the type of client the app uses (in local client both are equivalent, but with socket/gRPC the app may prefer to have CometBFT panic first). We have added an extra task in #2317 to explain this in the ABCI spec. |
) Closes #2225 If the app returns an error on an ABCI call (in particular CheckTx), CometBFT should stop, because the error is unrecoverable. --- #### PR checklist - [X] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
) Closes #2225 If the app returns an error on an ABCI call (in particular CheckTx), CometBFT should stop, because the error is unrecoverable. --- #### PR checklist - [X] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 98983fc)
) Closes #2225 If the app returns an error on an ABCI call (in particular CheckTx), CometBFT should stop, because the error is unrecoverable. --- #### PR checklist - [X] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 98983fc) # Conflicts: # mempool/clist_mempool.go # mempool/clist_mempool_test.go
…ckport #2894) (#2903) Closes #2225 If the app returns an error on an ABCI call (in particular CheckTx), CometBFT should stop, because the error is unrecoverable. --- #### PR checklist - [X] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2894 done by [Mergify](https://mergify.com). Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com>
…ckport #2894) (#2904) Closes #2225 If the app returns an error on an ABCI call (in particular CheckTx), CometBFT should stop, because the error is unrecoverable. --- #### PR checklist - [X] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2894 done by [Mergify](https://mergify.com). --------- Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com> Co-authored-by: hvanz <hernan.vanzetto@gmail.com>
…2268) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Daniel <daniel.cason@informal.systems>
…2268) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit f3775f4)
…2268) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not 8000 finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit f3775f4) # Conflicts: # .changelog/v0.38.3/bug-fixes/1827-fix-recheck-async.md # config/toml.go # docs/references/config/config.toml.md # mempool/clist_mempool.go # mempool/clist_mempool_test.go
…ackport #2268) (#3019) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [X] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2268 done by [Mergify](https://mergify.com). Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com>
…ackport #2268) (#3020) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [X] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2268 done by [Mergify](https://mergify.com). --------- Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com> Co-authored-by: hvanz <hernan.vanzetto@gmail.com>
Thanks @p4u for reporting this bug!
Bug Report
Setup
CometBFT version: from v0.38 and up
What happened?
Comet panics with the message "recheck cursor is not nil before resCbFirstTime", which is found in this line:
cometbft/mempool/clist_mempool.go
Line 313 in 1f430f5
Briefly, when we are using a local ABCI client and the app's
CheckTx
method returns a non-nil error, the rechecking process is broken at this line, leavingrecheckCursor
andrecheckEnd
in an inconsistent state.These are the steps to reach the panic line:
recheckTxs
, which:recheckCursor
to the first element of the transaction listtxs
txs
, it sends a CheckTx request to the app. We are using a local client, so the call toproxyAppConn.CheckTxAsync
is synchronous, meaning that it will theCheckTx
method on the app and immediately call theglobalCb
callback to process the response.CheckTx
returns an error, the functionrecheckTxs
returns without finishing the rechecking process and leavingrecheckCursor
with a non-nil value.recheckCursor
is set to nil again only inresCbRecheck
, when it's equal torecheckEnd
.CheckTx
request of typeCHECK_TX_TYPE_CHECK
is sent to the app for a new tx; a response is handled byglobalCb
, which will panic becauserecheckCursor
is not nil.Notes
On previous versions of CometBFT,
CheckTxAsync
didn't return an error, so this problem didn't exist. And when this new behaviour was introduced, the documentation was not updated.The text was updated successfully, but these errors were encountered: