fix(mempool): panic when the app returns error on CheckTx request #2894

hvanz · 2024-04-25T08:37:14Z

If the app returns an error on an ABCI call (in particular CheckTx), CometBFT should stop, because the error is unrecoverable.

PR checklist

Tests written/updated
Changelog entry added in .changelog (we use unclog to manage our changelog)
Updated relevant documentation (docs/ or spec/) and code comments
Title follows the Conventional Commits spec

melekes

👍

) Closes #2225 If the app returns an error on an ABCI call (in particular CheckTx), CometBFT should stop, because the error is unrecoverable. --- #### PR checklist - [X] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 98983fc)

) Closes #2225 If the app returns an error on an ABCI call (in particular CheckTx), CometBFT should stop, because the error is unrecoverable. --- #### PR checklist - [X] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec (cherry picked from commit 98983fc) # Conflicts: # mempool/clist_mempool.go # mempool/clist_mempool_test.go

…ckport #2894) (#2903) Closes #2225 If the app returns an error on an ABCI call (in particular CheckTx), CometBFT should stop, because the error is unrecoverable. --- #### PR checklist - [X] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2894 done by [Mergify](https://mergify.com). Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com>

…ckport #2894) (#2904) Closes #2225 If the app returns an error on an ABCI call (in particular CheckTx), CometBFT should stop, because the error is unrecoverable. --- #### PR checklist - [X] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [X] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2894 done by [Mergify](https://mergify.com). --------- Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com> Co-authored-by: hvanz <hernan.vanzetto@gmail.com>

…2268) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Daniel <daniel.cason@informal.systems>

…2268) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit f3775f4)

…2268) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [ ] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Andy Nogueira <me@andynogueira.dev> Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit f3775f4) # Conflicts: # .changelog/v0.38.3/bug-fixes/1827-fix-recheck-async.md # config/toml.go # docs/references/config/config.toml.md # mempool/clist_mempool.go # mempool/clist_mempool_test.go

…ackport #2268) (#3019) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [X] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2268 done by [Mergify](https://mergify.com). Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com>

…ackport #2268) (#3020) Fixes ~~#2225 and~~ #1827 (#2225 is now fixed in a separate PR, #2894) The bug: during rechecking, when the `CheckTxAsync` request for the last transaction fails, then the `resCbRecheck` callback on the response is not called, and the recheck variables end up in a wrong state (`recheckCursor != nil`, meaning that recheck has not finished). This will cause a panic next time a new transaction arrives, and the `CheckTx` response finds that rechecking hasn't finished. This problem only happens when using the non-local ABCI client, where `CheckTx` responses may arrive late or never, so the response won't be processed by the callback. We have two options to fix this. 1. When we call `CheckTxAsync`, block waiting for a response. If the response never arrives, it will block `Update` forever. 2. After sending all recheck requests, we flush the app connection and set a timer to wait for late recheck responses. After the timer expires, we finalise rechecking properly. If a CheckTx response arrives late, we consider that it is safe to ignore it. This PR implements option 2, as we cannot allow the risk to block the node forever waiting for a response. With the proposed changes, now when we reach the end of the rechecking process, all requests and responses will be processed or discared, and `recheckCursor` will always be `nil`. This PR also: - refactors all recheck logic to put it into a separate `recheck` struct. The fix to the bug described above is the only change in the recheck logic. - adds 4 new tests. --- #### PR checklist - [x] Tests written/updated - [X] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - [X] Updated relevant documentation (`docs/` or `spec/`) and code comments - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #2268 done by [Mergify](https://mergify.com). --------- Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com> Co-authored-by: hvanz <hernan.vanzetto@gmail.com>

hvanz added 3 commits April 25, 2024 10:34

panic when app return error

5209097

add two tests

2ffd214

add changelog

ec701d3

hvanz added bug Something isn't working mempool backport-to-v0.38.x Tell Mergify to backport the PR to v0.38.x backport-to-v1.x Tell Mergify to backport the PR to v1.x labels Apr 25, 2024

hvanz self-assigned this Apr 25, 2024

hvanz requested review from a team as code owners April 25, 2024 08:37

no need to mock Flush

88f9d5c

melekes approved these changes Apr 25, 2024

View reviewed changes

hvanz mentioned this pull request Apr 25, 2024

fix(mempool): Fix data race when rechecking with async ABCI client #2268

Merged

4 tasks

hvanz added this pull request to the merge queue Apr 26, 2024

Merged via the queue into main with commit 98983fc Apr 26, 2024
38 checks passed

hvanz deleted the hvanz/mempool-recheck-panic-error-2225 branch April 26, 2024 08:52

This was referenced Apr 26, 2024

fix(mempool): panic when the app returns error on CheckTx request (backport #2894) #2903

Merged

fix(mempool): panic when the app returns error on CheckTx request (backport #2894) #2904

Merged

mergify bot mentioned this pull request May 7, 2024

fix(mempool): Fix data race when rechecking with async ABCI client (backport #2268) #3019

Merged

4 tasks

mergify bot mentioned this pull request May 7, 2024

fix(mempool): Fix data race when rechecking with async ABCI client (backport #2268) #3020

Merged

4 tasks

faddat mentioned this pull request Oct 15, 2024

chore: use latest cometbft-db in v0.38.x #4296

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(mempool): panic when the app returns error on CheckTx request #2894

fix(mempool): panic when the app returns error on CheckTx request #2894

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

fix(mempool): panic when the app returns error on CheckTx request #2894

fix(mempool): panic when the app returns error on CheckTx request #2894

Uh oh!

Conversation

Uh oh!

PR checklist

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!