blocksync: define the exact conditions for a node to attempt block syncing #3415

cason · 2024-07-03T19:42:04Z

Block sync is a protocol that allows a node joining an existing network to catch-up faster to the head (last block) of the blockchain. The same result, but at a substantially slower speed can be achieved by joining the consensus protocol and receiving from peers the blocks decided at each height of consensus, accompanied by a set of Precommit votes forming a commit for that block.

The conditions for a node to leave the Block Sync protocol and join the consensus protocol are encoded in the IsCaughtUp() method. There are somewhat complex but the rationale is to compare the local height, the latest block in the local copy of the blockchain, with the latest (highest) height reported by the node's peers as part of the Block Sync protocol. If the reported peers' heights is not more than 1 unit higher than the local height, the node is ready to join the consensus protocol.

There are some corner cases to consider, though. Essentially, the node should not block forever on Block Sync if it fails to retrieve block from its peers (after some reasonable amount of time). But the code at the moment requires the node to have peers that respond to the node's requests in a timely manner, which gives rise to other corner scenarios to be considered.

Of particular interest is the case in which the node running Block Sync is an active validator and it holds a substantial portion of the total validator voting power. In this scenario, most likely to be observed in testnets, the node might not need to receive any block from peers (i.e., to catch-up) because the network cannot commit blocks in the absence of the node, and its voting power as a validator. This particular scenario has been reported by users and is addressed by #3406.

The goal of this PR is to better formalize the conditions under which a node should remain attempting to retrieve blocks from its peers via Block Sync protocol, as we are likely to find out further corner cases not covered by the current implementation.

Notice that this issue is particularly relevant for networks from release v0.38.x, where block sync has became mandatory (see tendermint/tendermint#8433 for additional context). Up to release v0.37.x, node operators could just disable Block Sync in the configuration file if they realize that a node was getting stuck in this phase of the node's bootstrap/recovery procedure.

Definition of Done

Define, validate and formalize the conditions for a node to stay in the Block Sync protocol
- Stretch goal: write a specification and update/extend the documentation for this protocol
Adapt the implementation (internal/blocksync package) and the node setup (node package) accordingly
Produce test components (test units, e2e testbeds) to validate the behavior under the identified corner cases

The text was updated successfully, but these errors were encountered:

cason · 2024-07-03T19:50:05Z

The particular corner case addressed by #3406 is:

The node is an active validator
The network cannot progress in the absence of this node
So, no block could have been decided while this validator was offline
If this node starts from a clean state, or restart from a crash, it has no block at all to download from peers

The condition on the voting power of this validator was initially considered to be that it has 2/3+ of the total voting power, which means that it can run the network by itself, i.e., possible other validators are not needed to commit any block.

After some internal discussion, we realized that is enough a validator to have 1/3 (or more) of the total voting power to observe condition 2. The node may not be able to commit blocks by itself, but in its absence not block can be committed.

@cason

…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems>

@cason

…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit bd95579) # Conflicts: # internal/blocksync/reactor.go

@cason

…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit bd95579) # Conflicts: # .changelog/v0.38.3/bug-fixes/3406-blocksync-dont-stall-if-blocking-chain.md # blocksync/reactor.go # blocksync/reactor_test.go # node/node.go

@cason

…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit bd95579) # Conflicts: # blocksync/reactor_test.go # internal/blocksync/reactor.go # node/node.go # node/setup.go

@cason

…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems>

@cason

…king the chain (backport #3406) (#3420) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3406 done by [Mergify](https://mergify.com). --------- Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Daniel <daniel.cason@informal.systems>

@cason

…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems>

@cason

…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems>

@cason

…ing the chain (backport #3406) (#3421) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3406 done by [Mergify](https://mergify.com). --------- Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Daniel <daniel.cason@informal.systems>

@cason

…ing the chain (backport #3406) (#3422) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3406 done by [Mergify](https://mergify.com). --------- Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Daniel <daniel.cason@informal.systems>

Contributes to #3415 This is mainly refactoring to simplify `onlyValidatorIsUs` and `localNodeBlocksTheChain` (since the latter implies the former). It is a follow-up of #3406 (this is the part of #3406 that doesn't need to be backported) --- #### PR checklist - ~[ ] Tests written/updated~ - ~[ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog)~ - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>

@cason

…ing the chain (backport cometbft#3406) (cometbft#3421) Partially addresses cometbft#3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3406 done by [Mergify](https://mergify.com). --------- Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Daniel <daniel.cason@informal.systems>

cason added bug Something isn't working block-sync P:operator-experience Priority: Improve experience for operators labels Jul 3, 2024

cason added this to CometBFT Jul 3, 2024

github-project-automation bot moved this to Todo in CometBFT Jul 3, 2024

cason moved this from Todo to In Progress in CometBFT Jul 3, 2024

sergio-mena mentioned this issue Jul 4, 2024

fix(blocksync)!: don't block in blocksync if our voting power is blocking the chain #3406

Merged

3 tasks

mergify bot mentioned this issue Jul 4, 2024

fix(blocksync)!: don't block in blocksync if our voting power is blocking the chain (backport #3406) #3420

Merged

3 tasks

This was referenced Jul 4, 2024

fix(blocksync): don't block in blocksync if our voting power is blocking the chain (backport #3406) #3421

Merged

fix(blocksync): don't block in blocksync if our voting power is blocking the chain (backport #3406) #3422

Merged

sergio-mena mentioned this issue Aug 28, 2024

refactor(node): simplify 'node is blocking chain' logic #3885

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

blocksync: define the exact conditions for a node to attempt block syncing #3415

blocksync: define the exact conditions for a node to attempt block syncing #3415

Uh oh!

blocksync: define the exact conditions for a node to attempt block syncing #3415

blocksync: define the exact conditions for a node to attempt block syncing #3415

Comments

Definition of Done

Uh oh!