8000 mempool: panic when app's CheckTx returns an error while rechecking · Issue #2225 · cometbft/cometbft · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

mempool: panic when app's CheckTx returns an error while rechecking #2225

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
Tracked by #2317
hvanz opened this issue Feb 1, 2024 · 3 comments · Fixed by #2894
Closed
Tracked by #2317

mempool: panic when app's CheckTx returns an error while rechecking #2225

hvanz opened this issue Feb 1, 2024 · 3 comments · Fixed by #2894
Labels
bug Something isn't working mempool
Milestone

Comments

@hvanz
Copy link
Member
hvanz commented Feb 1, 2024

Thanks @p4u for reporting this bug!

Bug Report

Setup

CometBFT version: from v0.38 and up

What happened?

Comet panics with the message "recheck cursor is not nil before resCbFirstTime", which is found in this line:

panic("recheck cursor is not nil before resCbFirstTime")

Briefly, when we are using a local ABCI client and the app's CheckTx method returns a non-nil error, the rechecking process is broken at this line, leaving recheckCursor and recheckEnd in an inconsistent state.

These are the steps to reach the panic line:

  • Rechecking is started by recheckTxs, which:
    • sets recheckCursor to the first element of the transaction list txs
    • for every element in txs, it sends a CheckTx request to the app. We are using a local client, so the call to proxyAppConn.CheckTxAsync is synchronous, meaning that it will the CheckTx method on the app and immediately call the globalCb callback to process the response.
    • If the app's CheckTx returns an error, the function recheckTxs returns without finishing the rechecking process and leaving recheckCursor with a non-nil value. recheckCursor is set to nil again only in resCbRecheck, when it's equal to recheckEnd.
  • A CheckTx request of type CHECK_TX_TYPE_CHECK is sent to the app for a new tx; a response is handled by globalCb, which will panic because recheckCursor is not nil.

Notes

On previous versions of CometBFT, CheckTxAsync didn't return an error, so this problem didn't exist. And when this new behaviour was introduced, the documentation was not updated.

@p4u
Copy link
p4u commented Feb 2, 2024

I've opened a hotfix PR (#2228) to tackle an immediate panic issue.

However, a deeper solution is warranted, particularly concerning how the mempool processes the errors introduced recently. There's currently ambiguity in determining when to use CheckTxResponse.Code=1 versus returning an error. The critical question is: what role does returning an error play?

In any case, encountering an error during a transaction's execution should not halt the recheck cursor's progression. Therefore, this PR should be valid and it directly addresses our problem.

@cason
Copy link
Contributor
cason commented Apr 17, 2024

when this new behaviour was introduced, the documentation was not updated.

Are we addressing this?

I agree with the previous comment that there is no proper semantic for CheckTx returning an error. Since this was introduced recently and we don't have a defined semantics, we should ignored returned errors...

@sergio-mena
Copy link
Contributor

Sorry for chiming in so late here. I apologize.
The main problem here is that the ABCI spec is missing an explanation of what it means when the app returns an error in the ABCI method call itself. The semantics, as explained here, are the following. The app returns an error in the ABCI call when it want to panic because it has detected and unrecoverable error. If that is the case, the app has two choices:

  • the app panics itself
  • the app returns an error and waits for CometBFT to panic

Which alternative to follow depends on the type of client the app uses (in local client both are equivalent, but with socket/gRPC the app may prefer to have CometBFT panic first).

We have added an extra task in #2317 to explain this in the ABCI spec.

github-merge-queue bot pushed a commit that referenced this issue Apr 26, 2024
)

Closes #2225

If the app returns an error on an ABCI call (in particular CheckTx),
CometBFT should stop, because the error is unrecoverable.

---

#### PR checklist

- [X] Tests written/updated
- [X] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
@github-project-automation github-project-automation bot moved this from Todo to Done in CometBFT Apr 26, 2024
mergify bot pushed a commit that referenced this issue Apr 26, 2024
)

Closes #2225

If the app returns an error on an ABCI call (in particular CheckTx),
CometBFT should stop, because the error is unrecoverable.

---

#### PR checklist

- [X] Tests written/updated
- [X] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

(cherry picked from commit 98983fc)
mergify bot pushed a commit that referenced this issue Apr 26, 2024
)

Closes #2225

If the app returns an error on an ABCI call (in particular CheckTx),
CometBFT should stop, because the error is unrecoverable.

---

#### PR checklist

- [X] Tests written/updated
- [X] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

(cherry picked from commit 98983fc)

# Conflicts:
#	mempool/clist_mempool.go
#	mempool/clist_mempool_test.go
hvanz added a commit that referenced this issue Apr 26, 2024
…ckport #2894) (#2903)

Closes #2225

If the app returns an error on an ABCI call (in particular CheckTx),
CometBFT should stop, because the error is unrecoverable.

---

#### PR checklist

- [X] Tests written/updated
- [X] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #2894 done by
[Mergify](https://mergify.com).

Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com>
hvanz added a commit that referenced this issue Apr 26, 2024
…ckport #2894) (#2904)

Closes #2225

If the app returns an error on an ABCI call (in particular CheckTx),
CometBFT should stop, because the error is unrecoverable.

---

#### PR checklist

- [X] Tests written/updated
- [X] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [X] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #2894 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com>
Co-authored-by: hvanz <hernan.vanzetto@gmail.com>
github-merge-queue bot pushed a commit that referenced this issue May 7, 2024
…2268)

Fixes ~~#2225 and~~ #1827 

(#2225 is now fixed in a separate PR, #2894)

The bug: during rechecking, when the `CheckTxAsync` request for the last
transaction fails, then the `resCbRecheck` callback on the response is
not called, and the recheck variables end up in a wrong state
(`recheckCursor != nil`, meaning that recheck has not finished). This
will cause a panic next time a new transaction arrives, and the
`CheckTx` response finds that rechecking hasn't finished.

This problem only happens when using the non-local ABCI client, where
`CheckTx` responses may arrive late or never, so the response won't be
processed by the callback. We have two options to fix this.
1. When we call `CheckTxAsync`, block waiting for a response. If the
response never arrives, it will block `Update` forever.
2. After sending all recheck requests, we flush the app connection and
set a timer to wait for late recheck responses. After the timer expires,
we finalise rechecking properly. If a CheckTx response arrives late, we
consider that it is safe to ignore it.

This PR implements option 2, as we cannot allow the risk to block the
node forever waiting for a response.

With the proposed changes, now when we reach the end of the rechecking
process, all requests and responses will be processed or discared, and
`recheckCursor` will always be `nil`.

This PR also:
- refactors all recheck logic to put it into a separate `recheck`
struct. The fix to the bug described above is the only change in the
recheck logic.
- adds 4 new tests.

---

#### PR checklist

- [x] Tests written/updated
- [ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

---------

Co-authored-by: Andy Nogueira <me@andynogueira.dev>
Co-authored-by: Daniel <daniel.cason@informal.systems>
mergify bot pushed a commit that referenced this issue May 7, 2024
…2268)

Fixes ~~#2225 and~~ #1827

(#2225 is now fixed in a separate PR, #2894)

The bug: during rechecking, when the `CheckTxAsync` request for the last
transaction fails, then the `resCbRecheck` callback on the response is
not called, and the recheck variables end up in a wrong state
(`recheckCursor != nil`, meaning that recheck has not finished). This
will cause a panic next time a new transaction arrives, and the
`CheckTx` response finds that rechecking hasn't finished.

This problem only happens when using the non-local ABCI client, where
`CheckTx` responses may arrive late or never, so the response won't be
processed by the callback. We have two options to fix this.
1. When we call `CheckTxAsync`, block waiting for a response. If the
response never arrives, it will block `Update` forever.
2. After sending all recheck requests, we flush the app connection and
set a timer to wait for late recheck responses. After the timer expires,
we finalise rechecking properly. If a CheckTx response arrives late, we
consider that it is safe to ignore it.

This PR implements option 2, as we cannot allow the risk to block the
node forever waiting for a response.

With the proposed changes, now when we reach the end of the rechecking
process, all requests and responses will be processed or discared, and
`recheckCursor` will always be `nil`.

This PR also:
- refactors all recheck logic to put it into a separate `recheck`
struct. The fix to the bug described above is the only change in the
recheck logic.
- adds 4 new tests.

---

#### PR checklist

- [x] Tests written/updated
- [ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

---------

Co-authored-by: Andy Nogueira <me@andynogueira.dev>
Co-authored-by: Daniel <daniel.cason@informal.systems>
(cherry picked from commit f3775f4)
mergify bot pushed a commit that referenced this issue May 7, 2024
…2268)

Fixes ~~#2225 and~~ #1827

(#2225 is now fixed in a separate PR, #2894)

The bug: during rechecking, when the `CheckTxAsync` request for the last
transaction fails, then the `resCbRecheck` callback on the response is
not called, and the recheck variables end up in a wrong state
(`recheckCursor != nil`, meaning that recheck has not
8000
 finished). This
will cause a panic next time a new transaction arrives, and the
`CheckTx` response finds that rechecking hasn't finished.

This problem only happens when using the non-local ABCI client, where
`CheckTx` responses may arrive late or never, so the response won't be
processed by the callback. We have two options to fix this.
1. When we call `CheckTxAsync`, block waiting for a response. If the
response never arrives, it will block `Update` forever.
2. After sending all recheck requests, we flush the app connection and
set a timer to wait for late recheck responses. After the timer expires,
we finalise rechecking properly. If a CheckTx response arrives late, we
consider that it is safe to ignore it.

This PR implements option 2, as we cannot allow the risk to block the
node forever waiting for a response.

With the proposed changes, now when we reach the end of the rechecking
process, all requests and responses will be processed or discared, and
`recheckCursor` will always be `nil`.

This PR also:
- refactors all recheck logic to put it into a separate `recheck`
struct. The fix to the bug described above is the only change in the
recheck logic.
- adds 4 new tests.

---

#### PR checklist

- [x] Tests written/updated
- [ ] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [ ] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec

---------

Co-authored-by: Andy Nogueira <me@andynogueira.dev>
Co-authored-by: Daniel <daniel.cason@informal.systems>
(cherry picked from commit f3775f4)

# Conflicts:
#	.changelog/v0.38.3/bug-fixes/1827-fix-recheck-async.md
#	config/toml.go
#	docs/references/config/config.toml.md
#	mempool/clist_mempool.go
#	mempool/clist_mempool_test.go
hvanz added a commit that referenced this issue May 7, 2024
…ackport #2268) (#3019)

Fixes ~~#2225 and~~ #1827 

(#2225 is now fixed in a separate PR, #2894)

The bug: during rechecking, when the `CheckTxAsync` request for the last
transaction fails, then the `resCbRecheck` callback on the response is
not called, and the recheck variables end up in a wrong state
(`recheckCursor != nil`, meaning that recheck has not finished). This
will cause a panic next time a new transaction arrives, and the
`CheckTx` response finds that rechecking hasn't finished.

This problem only happens when using the non-local ABCI client, where
`CheckTx` responses may arrive late or never, so the response won't be
processed by the callback. We have two options to fix this.
1. When we call `CheckTxAsync`, block waiting for a response. If the
response never arrives, it will block `Update` forever.
2. After sending all recheck requests, we flush the app connection and
set a timer to wait for late recheck responses. After the timer expires,
we finalise rechecking properly. If a CheckTx response arrives late, we
consider that it is safe to ignore it.

This PR implements option 2, as we cannot allow the risk to block the
node forever waiting for a response.

With the proposed changes, now when we reach the end of the rechecking
process, all requests and responses will be processed or discared, and
`recheckCursor` will always be `nil`.

This PR also:
- refactors all recheck logic to put it into a separate `recheck`
struct. The fix to the bug described above is the only change in the
recheck logic.
- adds 4 new tests.

---

#### PR checklist

- [x] Tests written/updated
- [X] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [X] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #2268 done by
[Mergify](https://mergify.com).

Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com>
hvanz added a commit that referenced this issue May 7, 2024
…ackport #2268) (#3020)

Fixes ~~#2225 and~~ #1827 

(#2225 is now fixed in a separate PR, #2894)

The bug: during rechecking, when the `CheckTxAsync` request for the last
transaction fails, then the `resCbRecheck` callback on the response is
not called, and the recheck variables end up in a wrong state
(`recheckCursor != nil`, meaning that recheck has not finished). This
will cause a panic next time a new transaction arrives, and the
`CheckTx` response finds that rechecking hasn't finished.

This problem only happens when using the non-local ABCI client, where
`CheckTx` responses may arrive late or never, so the response won't be
processed by the callback. We have two options to fix this.
1. When we call `CheckTxAsync`, block waiting for a response. If the
response never arrives, it will block `Update` forever.
2. After sending all recheck requests, we flush the app connection and
set a timer to wait for late recheck responses. After the timer expires,
we finalise rechecking properly. If a CheckTx response arrives late, we
consider that it is safe to ignore it.

This PR implements option 2, as we cannot allow the risk to block the
node forever waiting for a response.

With the proposed changes, now when we reach the end of the rechecking
process, all requests and responses will be processed or discared, and
`recheckCursor` will always be `nil`.

This PR also:
- refactors all recheck logic to put it into a separate `recheck`
struct. The fix to the bug described above is the only change in the
recheck logic.
- adds 4 new tests.

---

#### PR checklist

- [x] Tests written/updated
- [X] Changelog entry added in `.changelog` (we use
[unclog](https://github.com/informalsystems/unclog) to manage our
changelog)
- [X] Updated relevant documentation (`docs/` or `spec/`) and code
comments
- [x] Title follows the [Conventional
Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec
<hr>This is an automatic backport of pull request #2268 done by
[Mergify](https://mergify.com).

---------

Co-authored-by: Hernán Vanzetto <15466498+hvanz@users.noreply.github.com>
Co-authored-by: hvanz <hernan.vanzetto@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working mempool
Projects
No open projects
Status: Done
Development

Successfully merging a pull request may close this issue.

5 participants
0