mempool: panic when app's CheckTx returns an error while rechecking #2225

hvanz · 2024-02-01T23:30:33Z

Thanks @p4u for reporting this bug!

Bug Report

Setup

CometBFT version: from v0.38 and up

What happened?

Comet panics with the message "recheck cursor is not nil before resCbFirstTime", which is found in this line:

cometbft/mempool/clist_mempool.go

Line 313 in 1f430f5

panic("recheck cursor is not nil before resCbFirstTime")

Briefly, when we are using a local ABCI client and the app's CheckTx method returns a non-nil error, the rechecking process is broken at this line, leaving recheckCursor and recheckEnd in an inconsistent state.

These are the steps to reach the panic line:

Rechecking is started by recheckTxs, which:
- sets recheckCursor to the first element of the transaction list txs
- for every element in txs, it sends a CheckTx request to the app. We are using a local client, so the call to proxyAppConn.CheckTxAsync is synchronous, meaning that it will the CheckTx method on the app and immediately call the globalCb callback to process the response.
- If the app's CheckTx returns an error, the function recheckTxs returns without finishing the rechecking process and leaving recheckCursor with a non-nil value. recheckCursor is set to nil again only in resCbRecheck, when it's equal to recheckEnd.
A CheckTx request of type CHECK_TX_TYPE_CHECK is sent to the app for a new tx; a response is handled by globalCb, which will panic because recheckCursor is not nil.

Notes

On previous versions of CometBFT, CheckTxAsync didn't return an error, so this problem didn't exist. And when this new behaviour was introduced, the documentation was not updated.

The text was updated successfully, but these errors were encountered:

p4u · 2024-02-02T10:16:53Z

I've opened a hotfix PR (#2228) to tackle an immediate panic issue.

However, a deeper solution is warranted, particularly concerning how the mempool processes the errors introduced recently. There's currently ambiguity in determining when to use CheckTxResponse.Code=1 versus returning an error. The critical question is: what role does returning an error play?

In any case, encountering an error during a transaction's execution should not halt the recheck cursor's progression. Therefore, this PR should be valid and it directly addresses our problem.

cason · 2024-04-17T06:45:00Z

when this new behaviour was introduced, the documentation was not updated.

Are we addressing this?

I agree with the previous comment that there is no proper semantic for CheckTx returning an error. Since this was introduced recently and we don't have a defined semantics, we should ignored returned errors...

sergio-mena · 2024-04-24T17:42:22Z

Sorry for chiming in so late here. I apologize.
The main problem here is that the ABCI spec is missing an explanation of what it means when the app returns an error in the ABCI method call itself. The semantics, as explained here, are the following. The app returns an error in the ABCI call when it want to panic because it has detected and unrecoverable error. If that is the case, the app has two choices:

the app panics itself
the app returns an error and waits for CometBFT to panic

Which alternative to follow depends on the type of client the app uses (in local client both are equivalent, but with socket/gRPC the app may prefer to have CometBFT panic first).

We have added an extra task in #2317 to explain this in the ABCI spec.

) Closes #2225 If the app returns an error on an ABCI call (in particular CheckTx), CometBFT should stop, because the error is unrecoverable. --- #### PR checklist - [X] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

) Closes #2225 If the app returns an error on an ABCI call (in particular CheckTx), CometBFT should stop, because the error is unrecoverable. --- #### PR checklist - [X] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 98983fc)

) Closes #2225 If the app returns an error on an ABCI call (in particular CheckTx), CometBFT should stop, because the error is unrecoverable. --- #### PR checklist - [X] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 98983fc) # Conflicts: # mempool/clist_mempool.go # mempool/clist_mempool_test.go

…ckport #2894) (#2903) Closes #2225 If the app returns an error on an ABCI call (in particular CheckTx), CometBFT should stop, because the error is unrecoverable. --- #### PR checklist - [X] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2894 done by [Mergify](https://mergify.com). Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com>

…ckport #2894) (#2904) Closes #2225 If the app returns an error on an ABCI call (in particular CheckTx), CometBFT should stop, because the error is unrecoverable. --- #### PR checklist - [X] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2894 done by [Mergify](https://mergify.com). --------- Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com> Co-authored-by: hvanz <hernan.vanzetto@gmail.com>

…2268) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Daniel <daniel.cason@informal.systems>

…2268) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit f3775f4)

…2268) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not 8000 finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit f3775f4) # Conflicts: # .changelog/v0.38.3/bug-fixes/1827-fix-recheck-async.md # config/toml.go # docs/references/config/config.toml.md # mempool/clist_mempool.go # mempool/clist_mempool_test.go

…ackport #2268) (#3019) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [X] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2268 done by [Mergify](https://mergify.com). Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com>

…ackport #2268) (#3020) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [X] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2268 done by [Mergify](https://mergify.com). --------- Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com> Co-authored-by: hvanz <hernan.vanzetto@gmail.com>

hvanz added bug Something isn't working mempool labels Feb 1, 2024

hvanz added this to CometBFT Feb 1, 2024

github-project-automation bot moved this to Todo in CometBFT Feb 1, 2024

p4u mentioned this issue Feb 2, 2024

fix(mempool): panic when app's CheckTx returns an error while rechecking #2228

Closed

4 tasks

This was referenced Feb 8, 2024

fix(mempool): Fix data race when rechecking with async ABCI client #2268

Merged

mempool: Recheck improvements and fixes #2317

Closed

adizere added this to the 2024-Q2 milestone Apr 3, 2024

hvanz mentioned this issue Apr 25, 2024

fix(mempool): panic when the app returns error on CheckTx request #2894

Merged

4 tasks

hvanz closed this as completed in #2894 Apr 26, 2024

github-project-automation bot moved this from Todo to Done in CometBFT Apr 26, 2024

This was referenced Apr 26, 2024

fix(mempool): panic when the app returns error on CheckTx request (backport #2894) #2903

Merged

fix(mempool): panic when the app returns error on CheckTx request (backport #2894) #2904

Merged

mergify bot mentioned this issue May 7, 2024

fix(mempool): Fix data race when rechecking with async ABCI client (backport #2268) #3019

Merged

4 tasks

mergify bot mentioned this issue May 7, 2024

fix(mempool): Fix data race when rechecking with async ABCI client (backport #2268) #3020

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

mempool: panic when app's CheckTx returns an error while rechecking #2225

mempool: panic when app's CheckTx returns an error while rechecking #2225

Uh oh!

Uh oh!

Uh oh!

mempool: panic when app's CheckTx returns an error while rechecking #2225

mempool: panic when app's CheckTx returns an error while rechecking #2225

Comments

Bug Report

Setup

What happened?

These are the steps to reach the panic line:

Notes

Uh oh!

Uh oh!

Uh oh!