-
Notifications
You must be signed in to change notification settings - Fork 638
blocksync: define the exact conditions for a node to attempt block syncing #3415
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
The particular corner case addressed by #3406 is:
The condition on the voting power of this validator was initially considered to be that it has After some internal discussion, we realized that is enough a validator to have |
…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems>
…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit bd95579) # Conflicts: # internal/blocksync/reactor.go
…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit bd95579) # Conflicts: # .changelog/v0.38.3/bug-fixes/3406-blocksync-dont-stall-if-blocking-chain.md # blocksync/reactor.go # blocksync/reactor_test.go # node/node.go
…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems> (cherry picked from commit bd95579) # Conflicts: # blocksync/reactor_test.go # internal/blocksync/reactor.go # node/node.go # node/setup.go
…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems>
…king the chain (backport #3406) (#3420) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3406 done by [Mergify](https://mergify.com). --------- Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Daniel <daniel.cason@informal.systems>
…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems>
…king the chain (#3406) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec --------- Co-authored-by: Daniel <daniel.cason@informal.systems>
…ing the chain (backport #3406) (#3421) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3406 done by [Mergify](https://mergify.com). --------- Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Daniel <daniel.cason@informal.systems>
…ing the chain (backport #3406) (#3422) Partially addresses #3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request #3406 done by [Mergify](https://mergify.com). --------- Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Daniel <daniel.cason@informal.systems>
Contributes to #3415 This is mainly refactoring to simplify `onlyValidatorIsUs` and `localNodeBlocksTheChain` (since the latter implies the former). It is a follow-up of #3406 (this is the part of #3406 that doesn't need to be backported) --- #### PR checklist - ~[ ] Tests written/updated~ - ~[ ] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog)~ - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ --------- Co-authored-by: mergify[bot] <37929162+mergify[bot]@users.noreply.github.com> Co-authored-by: Anton Kaliaev <anton.kalyaev@gmail.com>
…ing the chain (backport cometbft#3406) (cometbft#3421) Partially addresses cometbft#3415 The a node has no peers, blocksync gets stuck without switching to consesnus, because it needs info from other peers to have an idea of maximum height. However, there is an edge case (mainly when testing) where a validator might have >2/3 of the voting power and other validators are not started. In this case, we know we are blocking the chain, so we don't need to stay in blockchain if the only condition is that we don't have peers. Moreover, in order to block a chain, 1/3 of the voting power is enough, so the reasoning of this fix is the following: * _I am a node and I am starting... shall I run blocksync?_ * _Well, looks like I have 1/3 of the voting power (or more) at my current height... so there's no way the chain could advance in my absence... so I don't need to blocksync"_ Explanation of commits: * Commit 1: `e2e` testbed reproducing the issue * Commit 2: commit with a trivial change to trigger `e2e` tests. Check the error: ❌ next to the commit hash (3fb1057) * Commit 3: Tentative fix. Although there is a ❌ next to the commit hash (16a46ea), if you click on it, you'll see that `e2e` are passing now. * Commit 4: revert commit2 * Commit 5: Move the check for "local node is blocking the chain" outside the pool, as suggested by @cason * Commit 6: Fixed unit tests All further commits: addressing other comments and tidying up the code --- #### PR checklist - [x] Tests written/updated - [x] Changelog entry added in `.changelog` (we use [unclog](https://github.com/informalsystems/unclog) to manage our changelog) - ~[ ] Updated relevant documentation (`docs/` or `spec/`) and code comments~ - [x] Title follows the [Conventional Commits](https://www.conventionalcommits.org/en/v1.0.0/) spec <hr>This is an automatic backport of pull request cometbft#3406 done by [Mergify](https://mergify.com). --------- Co-authored-by: Sergio Mena <sergio@informal.systems> Co-authored-by: Daniel <daniel.cason@informal.systems>
Block sync is a protocol that allows a node joining an existing network to catch-up faster to the head (last block) of the blockchain. The same result, but at a substantially slower speed can be achieved by joining the consensus protocol and receiving from peers the blocks decided at each height of consensus, accompanied by a set of
Precommit
votes forming a commit for that block.The conditions for a node to leave the Block Sync protocol and join the consensus protocol are encoded in the
IsCaughtUp() method
. There are somewhat complex but the rationale is to compare the local height, the latest block in the local copy of the blockchain, with the latest (highest) height reported by the node's peers as part of the Block Sync protocol. If the reported peers' heights is not more than 1 unit higher than the local height, the node is ready to join the consensus protocol.There are some corner cases to consider, though. Essentially, the node should not block forever on Block Sync if it fails to retrieve block from its peers (after some reasonable amount of time). But the code at the moment requires the node to have peers that respond to the node's requests in a timely manner, which gives rise to other corner scenarios to be considered.
Of particular interest is the case in which the node running Block Sync is an active validator and it holds a substantial portion of the total validator voting power. In this scenario, most likely to be observed in testnets, the node might not need to receive any block from peers (i.e., to catch-up) because the network cannot commit blocks in the absence of the node, and its voting power as a validator. This particular scenario has been reported by users and is addressed by #3406.
The goal of this PR is to better formalize the conditions under which a node should remain attempting to retrieve blocks from its peers via Block Sync protocol, as we are likely to find out further corner cases not covered by the current implementation.
Notice that this issue is particularly relevant for networks from release v0.38.x, where block sync has became mandatory (see tendermint/tendermint#8433 for additional context). Up to release v0.37.x, node operators could just disable Block Sync in the configuration file if they realize that a node was getting stuck in this phase of the node's bootstrap/recovery procedure.
Definition of Done
internal/blocksync
package) and the node setup (node
package) accordinglyThe text was updated successfully, but these errors were encountered: