Failsafes to prevent a consensus round from taking too long #5277

ximinez · 2025-02-05T04:14:24Z

High Level Overview of Change

This PR, if merged, will introduce two fail safes into the consensus logic to prevent a consensus round from remaining in the establish phase indefinitely.

Detects if the consensus process is "stalled". If it is, then we can declare a consensus and end successfully even if we do not have 80% agreement on our proposal. Details below in "Before / After".
If we have been in the establish phase for more than 10x the previous consensus establish phase's time, then consensus is considered "expired", and we will leave the round, which sends a partial validation (indicating that the node is moving on without validating). There are restrictions intended to avoid prematurely exiting, or having an extended exit in extreme situations. Details below in "Before / After".
1. When enough nodes leave the round, any remaining nodes will see they've fallen behind, and move on, too, generally before hitting the timeout. Any validations or partial validations sent during this time will help the consensus process bring the nodes back together.

Context of Change

At about 9:54pm UTC on 2/4/2025, the network successfully validated ledger 93927173, and started the consensus round for 93927174. That round did not end for over an hour.

The current evidence indicates that two things happened.

Some disputed transactions had just enough "yes" votes that validators voting "yes" saw the approval as just over 95%, while those voting "no" saw the approval as just under 95%. Thus, every node thought that it was doing the right thing, and no nodes changed their vote. While this is annoying, normally consensus will move on because at least 80% of the UNL validators will be in agreement over which transaction set to use, and so consensus moves on with that set. However,
The disputed transactions with the close approval rates were distributed such that there were several clumps of validators voting yes for different transactions than other clumps of validators. This led to a situation where no transaction set had 80% approval.

This led to a livelock situation where every node was waiting for some other node to make a change, while none of the nodes were willing to change.

This decision algorithm has been in place for at least 8 years, and possibly since the first release of rippled. The odds of it happening were thought to be 0, but it turns out they're just very 8000 very small.

Type of Change

Bug fix (non-breaking change which fixes an issue)

This change is fully backward and forward compatible, and does not require an amendment.

Before / After

This is an outline of the changes in this PR.

In NetworkOPsImp::processTrustedProposal, if the proposal is from us, it don't process it. This should be impossible, but this will help confirm if not.
Detects if the consensus process is "stalled". If it is, then we can declare a consensus and end successfully even if we do not have 80% agreement on our proposal. (checkConsensusReached's other restrictions, such as minimum proposers, still apply)
1. "Stalled" is distinct from "stuck" used as a consensus percentage state. Naming things is hard.
2. "Stalled" is defined as:
  1. We have a close time consensus
  2. Each disputed transaction is individually stalled:
    1. It has been in the final "stuck" 95% requirement for at least 2 (avMIN_ROUNDS) "inner rounds" of phaseEstablish,
    2. and either all of the other trusted proposers or this validator, if proposing, have had the same vote(s) vote for at least 4 (avSTALLED_ROUNDS) "inner rounds",
    3. and at least 80% of the validators (including this one, if appropriate) agree about the vote (whether yes or no).
If we have been in the establish phase for more than 10x the previous consensus round time, then consensus is considered "expired", and we will leave the round, which sends a partial validation. There are two restrictions intended to avoid prematurely exiting, or having an extended exit in extreme situations.
1. The 10x time is clamped to be within a range of 15s (ledgerMAX_CONSENSUS) to 120s (ledgerABANDON_CONSENSUS).
2. If consensus has not had an opportunity to walk through all percentage avalanche states (defined as not going through 8 "inner rounds" of phaseEstablish), then ConsensusState::Expired is treated as ConsensusState::No.
The close time avalanching defined in ConsensusParms.h has been rewritten as more of an explicit state machine. It's basically a map of states to their time cutoff, percentage cutoff, and next state, and a getNeededWeight function that will evaluate whether to move to the next state.
1. This function is used for both disputed transactions and close time consensus.
2. In addition to the the "previous round time percentage" limits, disputed transactions will also be required to spend at least 2 (avMIN_ROUNDS) "inner rounds" in each state. Close time consensus does not have this restriction (but it could).
3. This map is more easily modifiable than the previous individual variables, so if we decide to change parameters, it only requires changing the map.
Adds tests of the functionality that detects the "stalled" state.
Finally, it adds a simulation unit test that attempts to recreate the scenario that got us here in the first place. However, I'm not very familiar with the simulator, and we're still not 100% sure how we got into this state in the first place, so it's not very good.

codecov · 2025-02-05T04:39:22Z

Codecov Report

Attention: Patch coverage is 82.72727% with 19 lines in your changes missing coverage. Please review.

Project coverage is 78.1%. Comparing base (75a2019) to head (76ef214).
Report is 1 commits behind head on develop.

Files with missing lines	Patch %	Lines
src/xrpld/consensus/Consensus.h	58.3%	15 Missing ⚠️
src/xrpld/app/misc/NetworkOPs.cpp	66.7%	3 Missing ⚠️
src/xrpld/consensus/DisputedTx.h	97.0%	1 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           develop   #5277   +/-   ##
=======================================
  Coverage     78.1%   78.1%           
=======================================
  Files          790     790           
  Lines        67911   67988   +77     
  Branches      8234    8252   +18     
=======================================
+ Hits         53025   53093   +68     
- Misses       14886   14895    +9

Files with missing lines	Coverage Δ
src/xrpld/consensus/Consensus.cpp	`76.6% <100.0%> (+2.7%)`	⬆️
src/xrpld/consensus/ConsensusParms.h	`100.0% <100.0%> (ø)`
src/xrpld/consensus/ConsensusTypes.h	`79.5% <ø> (ø)`
src/xrpld/consensus/DisputedTx.h	`99.0% <97.0%> (+3.0%)`	⬆️
src/xrpld/app/misc/NetworkOPs.cpp	`69.1% <66.7%> (-<0.1%)`	⬇️
src/xrpld/consensus/Consensus.h	`85.0% <58.3%> (-1.9%)`	⬇️

... and 3 files with indirect coverage changes

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

src/xrpld/consensus/DisputedTx.h

src/xrpld/consensus/Consensus.cpp

Bronek

Would be cool to have a unit test for the last part of DisputedTx.h ; not sure how realistic that request is. Approved in any case.

- Stable state means that neither we, nor any of our peers has changed a vote on a disputed transaction in a while. This is undesirable if an 80% consensus has not otherwise been reached. It will cause a validation to be sent, which will help get other (trusting) validators back on track using preferred ledger logic.

vlntb

The current version fails to build on MacOS because Mac's version of libstdc++ is dropping the assignment operator for std::map pairs. There are two viable fixes:

Add an assignment operator to ConsensusParms:
We could add a custom assignment operator for ConsensusParms that, instead of assigning the avalancheCutoffs map directly (which triggers the error), manually copies its contents into the target map.
Make avalancheCutoffs const and remove the assignment:
By declaring the map as const, you avoid any assignment after its construction. I would prefer this option, but it means that we must update the unit tests that does

peer->consensusParms = parms;

so that it no longer tries to perform such an assignment.

src/xrpld/consensus/ConsensusParms.h

src/test/consensus/Consensus_test.cpp

* upstream/develop: chore: Rename missing-commits job, and combine nix job files (5268)

* upstream/develop: fix: Replace charge() by fee_.update() in OnMessage functions (5269) docs: ensure build_type and CMAKE_BUILD_TYPE match (5274) chore: Fix small typos in protocol files (5279)

src/xrpld/consensus/Consensus.h

src/xrpld/consensus/DisputedTx.h

src/xrpld/consensus/Consensus.cpp

src/test/consensus/Consensus_test.cpp

Bronek

This is good change. I like the unis tests coverage (but an open question re. commented out section), I like how the new Expired state only kicks in when we are not stalled on all transactions (nice decoupling on both states).

src/xrpld/consensus/Consensus.cpp

* upstream/develop: Set version to 2.4.0 Set version to 2.4.0-rc4 chore: Update XRPL Foundation Validator List URL (5326)

* upstream/develop: refactor: Remove unused and add missing includes (5293)

- Document "stalled" in checkConsensusReached. Also return early. - Log stalled consensus at higher level than regular. - const correctness (stalled in haveConsensus) - Change type of AvalancheCutoff::consensusTime. - Fix XRPL_ASSERT label - Log tx ID on unchanged dispute vote

Bronek · 2025-03-13T11:18:08Z

src/xrpld/consensus/ConsensusParms.h

+        // See if enough time has passed to move on to the next.
+        XRPL_ASSERT(
+            nextCutoff.consensusTime >= currentCutoff.consensusTime,
+            "ripple::getNeededWeight next state valid");


Please put colon : between scope name and description

Suggested change

"ripple::getNeededWeight next state valid");

"ripple::getNeededWeight : next state valid");

Please put colon : between scope name and description

Fixed

vlntb · 2025-03-13T14:09:00Z

This led to a deadlock-like situation where every node was waiting for some other node to make a change, while none of the nodes were willing to change.

Would the better name for the state be a livelock rather than a deadlock?
From wiki : "A livelock is similar to a deadlock, except that the states of the processes involved in the livelock constantly change with regard to one another, none progressing."

vlntb

LGTM. One minor suggestion for PR description:

This led to a deadlock-like situation where every node was waiting for some other node to make a change, while none of the nodes were willing to change.

Would the better name for the state be a livelock rather than a deadlock?
From wiki : "A livelock is similar to a deadlock, except that the states of the processes involved in the livelock constantly change with regard to one another, none progressing."

ximinez · 2025-03-17T21:54:37Z

Proposed commit message:

Prevent consensus from getting stuck in the establish phase (#5277)

- Detects if the consensus process is "stalled". If it is, then we can declare a 
  consensus and end successfully even if we do not have 80% agreement on
  our proposal.
  - "Stalled" is defined as:
    - We have a close time consensus
    - Each disputed transaction is individually stalled:
      - It has been in the final "stuck" 95% requirement for at least 2
        (avMIN_ROUNDS) "inner rounds" of phaseEstablish,
      - and either all of the other trusted proposers or this validator, if proposing,
        have had the same vote(s) for at least 4 (avSTALLED_ROUNDS) "inner
        rounds", and at least 80% of the validators (including this one, if
        appropriate) agree about the vote (whether yes or no).
- If we have been in the establish phase for more than 10x the previous
  consensus establish phase's time, then consensus is considered "expired",
  and we will leave the round, which sends a partial validation (indicating
  that the node is moving on without validating). Two restrictions avoid
  prematurely exiting, or having an extended exit in extreme situations.
  - The 10x time is clamped to be within a range of 15s
    (ledgerMAX_CONSENSUS) to 120s (ledgerABANDON_CONSENSUS).
  - If consensus has not had an opportunity to walk through all avalanche
    states (defined as not going through 8 "inner rounds" of phaseEstablish),
    then ConsensusState::Expired is treated as ConsensusState::No.
- When enough nodes leave the round, any remaining nodes will see they've
  fallen behind, and move on, too, generally before hitting the timeout. Any
  validations or partial validations sent during this time will help the
  consensus process bring the nodes back together.

* Set version to 2.4.0 * refactor: Remove unused and add missing includes (#5293) The codebase is filled with includes that are unused, and which thus can be removed. At the same time, the files often do not include all headers that contain the definitions used in those files. This change uses clang-format and clang-tidy to clean up the includes, with minor manual intervention to ensure the code compiles on all platforms. * refactor: Calculate numFeatures automatically (#5324) Requiring manual updates of numFeatures is an annoying manual process that is easily forgotten, and leads to frequent merge conflicts. This change takes advantage of the `XRPL_FEATURE` and `XRPL_FIX` macros, and adds a new `XRPL_RETIRE` macro to automatically set `numFeatures`. * refactor: Improve ordering of headers with clang-format (#5343) Removes all manual header groupings from source and header files by leveraging clang-format options. * Rename "deadlock" to "stall" in `LoadManager` (#5341) What the LoadManager class does is stall detection, which is not the same as deadlock detection. In the condition of severe CPU starvation, LoadManager will currently intentionally crash rippled reporting `LogicError: Deadlock detected`. This error message is misleading as the condition being detected is not a deadlock. This change fixes and refactors the code in response. * Adds hub.xrpl-commons.org as a new Bootstrap Cluster (#5263) * fix: Error message for ledger_entry rpc (#5344) Changes the error to `malformedAddress` for `permissioned_domain` in the `ledger_entry` rpc, when the account is not a string. This change makes it more clear to a user what is wrong with their request. * fix: Handle invalid marker parameter in grpc call (#5317) The `end_marker` is used to limit the range of ledger entries to fetch. If `end_marker` is less than `marker`, a crash can occur. This change adds an additional check. * fix: trust line RPC no ripple flag (#5345) The Trustline RPC `no_ripple` flag gets set depending on `lsfDefaultRipple` flag, which is not a flag of a trustline but of the account root. The `lsfDefaultRipple` flag does not provide any insight if this particular trust line has `lsfLowNoRipple` or `lsfHighNoRipple` flag set, so it should not be used here at all. This change simplifies the logic. * refactor: Updates Conan dependencies: RocksDB (#5335) Updates RocksDB to version 9.7.3, the latest version supported in Conan 1.x. A patch for 9.7.4 that fixes a memory leak is included. * fix: Remove null pointer deref, just do abort (#5338) This change removes the existing undefined behavior from `LogicError`, so we can be certain that there will be always a stacktrace. De-referencing a null pointer is an old trick to generate `SIGSEGV`, which would typically also create a stacktrace. However it is also an undefined behaviour and compilers can do something else. A more robust way to create a stacktrace while crashing the program is to use `std::abort`, which we have also used in this location for a long time. If we combine the two, we might not get the expected behaviour - namely, the nullpointer deref followed by `std::abort`, as handled in certain compiler versions may not immediately cause a crash. We have observed stacktrace being wiped instead, and thread put in indeterminate state, then stacktrace created without any useful information. * chore: Add PR number to payload (#5310) This PR adds one more payload field to the libXRPL compatibility check workflow - the PR number itself. * chore: Update link to ripple-binary-codec (#5355) The link to ripple-binary-codec's definitions.json appears to be outdated. The updated link is also documented here: https://xrpl.org/docs/references/protocol/binary-format#definitions-file * Prevent consensus from getting stuck in the establish phase (#5277) - Detects if the consensus process is "stalled". If it is, then we can declare a consensus and end successfully even if we do not have 80% agreement on our proposal. - "Stalled" is defined as: - We have a close time consensus - Each disputed transaction is individually stalled: - It has been in the final "stuck" 95% requirement for at least 2 (avMIN_ROUNDS) "inner rounds" of phaseEstablish, - and either all of the other trusted proposers or this validator, if proposing, have had the same vote(s) for at least 4 (avSTALLED_ROUNDS) "inner rounds", and at least 80% of the validators (including this one, if appropriate) agree about the vote (whether yes or no). - If we have been in the establish phase for more than 10x the previous consensus establish phase's time, then consensus is considered "expired", and we will leave the round, which sends a partial validation (indicating that the node is moving on without validating). Two restrictions avoid prematurely exiting, or having an extended exit in extreme situations. - The 10x time is clamped to be within a range of 15s (ledgerMAX_CONSENSUS) to 120s (ledgerABANDON_CONSENSUS). - If consensus has not had an opportunity to walk through all avalanche states (defined as not going through 8 "inner rounds" of phaseEstablish), then ConsensusState::Expired is treated as ConsensusState::No. - When enough nodes leave the round, any remaining nodes will see they've fallen behind, and move on, too, generally before hitting the timeout. Any validations or partial validations sent during this time will help the consensus process bring the nodes back together. --------- Co-authored-by: Michael Legleux <mlegleux@ripple.com> Co-authored-by: Bart <bthomee@users.noreply.github.com> Co-authored-by: Ed Hennis <ed@ripple.com> Co-authored-by: Bronek Kozicki <brok@incorrekt.com> Co-authored-by: Darius Tumas <Tokeiito@users.noreply.github.com> Co-authored-by: Sergey Kuznetsov <skuznetsov@ripple.com> Co-authored-by: cyan317 <120398799+cindyyan317@users.noreply.github.com> Co-authored-by: Vlad <129996061+vvysokikh1@users.noreply.github.com> Co-authored-by: Alex Kremer <akremer@ripple.com>

@mtrippled

* refactor: Remove unused and add missing includes (#5293) The codebase is filled with includes that are unused, and which thus can be removed. At the same time, the files often do not include all headers that contain the definitions used in those files. This change uses clang-format and clang-tidy to clean up the includes, with minor manual intervention to ensure the code compiles on all platforms. * refactor: Calculate numFeatures automatically (#5324) Requiring manual updates of numFeatures is an annoying manual process that is easily forgotten, and leads to frequent merge conflicts. This change takes advantage of the `XRPL_FEATURE` and `XRPL_FIX` macros, and adds a new `XRPL_RETIRE` macro to automatically set `numFeatures`. * refactor: Improve ordering of headers with clang-format (#5343) Removes all manual header groupings from source and header files by leveraging clang-format options. * Rename "deadlock" to "stall" in `LoadManager` (#5341) What the LoadManager class does is stall detection, which is not the same as deadlock detection. In the condition of severe CPU starvation, LoadManager will currently intentionally crash rippled reporting `LogicError: Deadlock detected`. This error message is misleading as the condition being detected is not a deadlock. This change fixes and refactors the code in response. * Adds hub.xrpl-commons.org as a new Bootstrap Cluster (#5263) * fix: Error message for ledger_entry rpc (#5344) Changes the error to `malformedAddress` for `permissioned_domain` in the `ledger_entry` rpc, when the account is not a string. This change makes it more clear to a user what is wrong with their request. * fix: Handle invalid marker parameter in grpc call (#5317) The `end_marker` is used to limit the range of ledger entries to fetch. If `end_marker` is less than `marker`, a crash can occur. This change adds an additional check. * fix: trust line RPC no ripple flag (#5345) The Trustline RPC `no_ripple` flag gets set depending on `lsfDefaultRipple` flag, which is not a flag of a trustline but of the account root. The `lsfDefaultRipple` flag does not provide any insight if this particular trust line has `lsfLowNoRipple` or `lsfHighNoRipple` flag set, so it should not be used here at all. This change simplifies the logic. * refactor: Updates Conan dependencies: RocksDB (#5335) Updates RocksDB to version 9.7.3, the latest version supported in Conan 1.x. A patch for 9.7.4 that fixes a memory leak is included. * fix: Remove null pointer deref, just do abort (#5338) This change removes the existing undefined behavior from `LogicError`, so we can be certain that there will be always a stacktrace. De-referencing a null pointer is an old trick to generate `SIGSEGV`, which would typically also create a stacktrace. However it is also an undefined behaviour and compilers can do something else. A more robust way to create a stacktrace while crashing the program is to use `std::abort`, which we have also used in this location for a long time. If we combine the two, we might not get the expected behaviour - namely, the nullpointer deref followed by `std::abort`, as handled in certain compiler versions may not immediately cause a crash. We have observed stacktrace being wiped instead, and thread put in indeterminate state, then stacktrace created without any useful information. * chore: Add PR number to payload (#5310) This PR adds one more payload field to the libXRPL compatibility check workflow - the PR number itself. * chore: Update link to ripple-binary-codec (#5355) The link to ripple-binary-codec's definitions.json appears to be outdated. The updated link is also documented here: https://xrpl.org/docs/references/protocol/binary-format#definitions-file * Prevent consensus from getting stuck in the establish phase (#5277) - Detects if the consensus process is "stalled". If it is, then we can declare a consensus and end successfully even if we do not have 80% agreement on our proposal. - "Stalled" is defined as: - We have a close time consensus - Each disputed transaction is individually stalled: - It has been in the final "stuck" 95% requirement for at least 2 (avMIN_ROUNDS) "inner rounds" of phaseEstablish, - and either all of the other trusted proposers or this validator, if proposing, have had the same vote(s) for at least 4 (avSTALLED_ROUNDS) "inner rounds", and at least 80% of the validators (including this one, if appropriate) agree about the vote (whether yes or no). - If we have been in the establish phase for more than 10x the previous consensus establish phase's time, then consensus is considered "expired", and we will leave the round, which sends a partial validation (indicating that the node is moving on without validating). Two restrictions avoid prematurely exiting, or having an extended exit in extreme situations. - The 10x time is clamped to be within a range of 15s (ledgerMAX_CONSENSUS) to 120s (ledgerABANDON_CONSENSUS). - If consensus has not had an opportunity to walk through all avalanche states (defined as not going through 8 "inner rounds" of phaseEstablish), then ConsensusState::Expired is treated as ConsensusState::No. - When enough nodes leave the round, any remaining nodes will see they've fallen behind, and move on, too, generally before hitting the timeout. Any validations or partial validations sent during this time will help the consensus process bring the nodes back together. * test: enable TxQ unit tests work with variable reference fee (#5118) In preparation for a potential reference fee change we would like to verify that fee change works as expected. The first step is to fix all unit tests to be able to work with different reference fee values. * test: enable unit tests to work with variable reference fee (#5145) Fix remaining unit tests to be able to process reference fee values other than 10. * Intrusive SHAMap smart pointers for efficient memory use and lock-free synchronization (#5152) The main goal of this optimisation is memory reduction in SHAMapTreeNodes by introducing intrusive pointers instead of standard std::shared_ptr and std::weak_ptr. * refactor: Move integration tests from 'examples/' into 'tests/' (#5367) This change moves `examples/example` into `tests/conan` to make it clear it is an integration test, and adjusts the `conan` CI job accordingly * test: enable compile time param to change reference fee value (#5159) Adds an extra CI pipeline to perform unit tests using different values for fees. * Fix undefined uint128_t type on Windows non-unity builds (#5377) As part of import optimization, a transitive include had been removed that defined `BOOST_COMP_MSVC` on Windows. In unity builds, this definition was pulled in, but in non-unity builds it was not - causing a compilation error. An inspection of the Boost code revealed that we can just gate the statements by `_MS_VER` instead. A `#pragma message` is added to verify that the statement is only printed on Windows builds. * fix: uint128 ambiguousness breaking macos unity build (#5386) * Fix to correct memory ordering for compare_exchange_weak and wait in the intrusive reference counting logic (#5381) This change addresses a memory ordering assertion failure observed on one of the Windows test machines during the IntrusiveShared_test suite. * fix: disable `channel_authorize` when `signing_support` is disabled (#5385) * fix: Use the build image from ghcr.io (#5390) The ci pipelines are constantly hitting Docker Hub's public rate limiting since increasing the number of jobs we're running. This change switches over to images hosted in GitHub's registry. * Remove UNREACHABLE from `NetworkOPsImp::processTrustedProposal` (#5387) It’s possible for this to happen legitimately if a set of peers, including a validator, are connected in a cycle, and the latency and message processing time between those peers is significantly less than the latency between the validator and the last peer. It’s unlikely in the real world, but obviously easy to simulate with Antithesis. * Instrument proposal, validation and transaction messages (#5348) Adds metric counters for the following P2P message types: * Untrusted proposal and validation messages * Duplicate proposal, validation and transaction messages * refactor(trivial): reorganize ledger entry tests and helper functions (#5376) This PR splits out `ledger_entry` tests into its own file (`LedgerEntry_test.cpp`) and alphabetizes the helper functions in `LedgerEntry.cpp`. These commits were split out of #5237 to make that PR a little more manageable, since these basic trivial changes are most of the diff. There is no code change, just moving code around. * fix: `fixPayChanV1` (#4717) This change introduces a new fix amendment (`fixPayChanV1`) that prevents the creation of new `PaymentChannelCreate` transaction with a `CancelAfter` time less than the current ledger time. It piggy backs off of fix1571. Once the amendment is activated, creating a new `PaymentChannel` will require that if you specify the `CancelAfter` time/value, that value must be greater than or equal to the current ledger time. Currently users can create a payment channel where the `CancelAfter` time is before the current ledger time. This results in the payment channel being immediately closed on the next PaymentChannel transaction. * Fix: admin RPC webhook queue limit removal and timeout reduction (#5163) When using subscribe at admin RPC port to send webhooks for the transaction stream to a backend, on large(r) ledgers the endpoint receives fewer HTTP POSTs with TX information than the amount of transactions in a ledger. This change removes the hardcoded queue length to avoid dropping TX notifications for the admin-only command. In addition, the per-request TTL for outgoing RPC HTTP calls has been reduced from 10 minutes to 30 seconds. * fix: Adds CTID to RPC tx and updates error (#4738) This change fixes a number of issues involved with CTID: * CTID is not present on all RPC tx transactions. * rpcWRONG_NETWORK is missing in the ErrorCodes.cpp * Temporary disable automatic triggering macOS pipeline (#5397) We temporarily disable running unit tests on macOS on the CI pipeline while we are investigating the delays. * refactor: Clean up test logging to make it easier to search (#5396) This PR replaces the word `failed` with `failure` in any test names and renames some test files to fix MSVC warnings, so that it is easier to search through the test output to find tests that failed. * chore: Run CI on PRs that are Ready or have the "DraftRunCI" label (#5400) - Avoids costly overhead for idle PRs where the CI results don't add any value. * fix: CTID to use correct ledger_index (#5408) * chore: Small clarification to lsfDefaultRipple comment (#5410) * fix: Replaces random endpoint resolution with sequential (#5365) This change addresses an issue where `rippled` attempts to connect to an IPv6 address, even when the local network lacks IPv6 support, resulting in a "Network is unreachable" error. The fix replaces the custom endpoint selection logic with `boost::async_connect`, which sequentially attempts to connect to available endpoints until one succeeds or all fail. * Improve transaction relay logic (#4985) Combines four related changes: 1. "Decrease `shouldRelay` limit to 30s." Pretty self-explanatory. Currently, the limit is 5 minutes, by which point the `HashRouter` entry could have expired, making this transaction look brand new (and thus causing it to be relayed back to peers which have sent it to us recently). 2. "Give a transaction more chances to be retried." Will put a transaction into `LedgerMaster`'s held transactions if the transaction gets a `ter`, `tel`, or `tef` result. Old behavior was just `ter`. * Additionally, to prevent a transaction from being repeatedly held indefinitely, it must meet some extra conditions. (Documented in a comment in the code.) 3. "Pop all transactions with sequential sequences, or tickets." When a transaction is processed successfully, currently, one held transaction for the same account (if any) will be popped out of the held transactions list, and queued up for the next transaction batch. This change pops all transactions for the account, but only if they have sequential sequences (for non-ticket transactions) or use a ticket. This issue was identified from interactions with @mtrippled's #4504, which was merged, but unfortunately reverted later by #4852. When the batches were spaced out, it could potentially take a very long time for a large number of held transactions for an account to get processed through. However, whether batched or not, this change will help get held transactions cleared out, particularly if a missing earlier transaction is what held them up. 4. "Process held transactions through existing NetworkOPs batching." In the current processing, at the end of each consensus round, all held transactions are directly applied to the open ledger, then the held list is reset. This bypasses all of the logic in `NetworkOPs::apply` which, among other things, broadcasts successful transactions to peers. This means that the transaction may not get broadcast to peers for a really long time (5 minutes in the current implementation, or 30 seconds with this first commit). If the node is a bottleneck (either due to network configuration, or because the transaction was submitted locally), the transaction may not be seen by any other nodes or validators before it expires or causes other problems. * Enable passive squelching (#5358) This change updates the squelching logic to accept squelch messages for untrusted validators. As a result, servers will also squelch untrusted validator messages reducing duplicate traffic they generate. In particular: * Updates squelch message handling logic to squelch messages for all validators, not only trusted ones. * Updates the logic to send squelch messages to peers that don't squelch themselves * Increases the threshold for the number of messages that a peer has to deliver to consider it as a candidate for validator messages. * Add PermissionDelegation feature (#5354) This change implements the account permission delegation described in XLS-75d, see XRPLF/XRPL-Standards#257. * Introduces transaction-level and granular permissions that can be delegated to other accounts. * Adds `DelegateSet` transaction to grant specified permissions to another account. * Adds `ltDelegate` ledger object to maintain the permission list for delegating/delegated account pair. * Adds an optional `Delegate` field in common fields, allowing a delegated account to send transactions on behalf of the delegating account within the granted permission scope. The `Account` field remains the delegating account; the `Delegate` field specifies the delegated account. The transaction is signed by the delegated account. * refactor: use east const convention (#5409) This change refactors the codebase to use the "east const convention", and adds a clang-format rule to follow this convention. * fix: enable LedgerStateFix for delegation (#5427) * Configure CODEOWNERS for changes to RPC code (#5266) To ensure changes to any RPC-related code are compatible with other services, such as Clio, the RPC team will be required to review them. * fix: Ensure that coverage file generation is atomic. (#5426) Running unit tests in parallel and multiple threads can write into one file can corrupt output files, and then gcovr won't be able to parse the corrupted file. This change adds -fprofile-update=atomic as instructed by https://gcc.gnu.org/bugzilla/show_bug.cgi?id=68080. * fix: Update validators-example.txt fix xrplf example URL (#5384) * Fix: Resolve slow test on macOS pipeline (#5392) Using std::barrier performs extremely poorly (~1 hour vs ~1 minute to run the test suite) in certain macOS environments. To unblock our macOS CI pipeline, std::barrier has been replaced with a custom mutex-based barrier (Barrier) that significantly improves performance without compromising correctness. * Set version to 2.5.0-b1 --------- Co-authored-by: Bart <bthomee@users.noreply.github.com> Co-authored-by: Ed Hennis <ed@ripple.com> Co-authored-by: Bronek Kozicki <brok@incorrekt.com> Co-authored-by: Darius Tumas <Tokeiito@users.noreply.github.com> Co-authored-by: Sergey Kuznetsov <skuznetsov@ripple.com> Co-authored-by: cyan317 <120398799+cindyyan317@users.noreply.github.com> Co-authored-by: Vlad <129996061+vvysokikh1@users.noreply.github.com> Co-authored-by: Alex Kremer <akremer@ripple.com> Co-authored-by: Valentin Balaschenko <13349202+vlntb@users.noreply.github.com> Co-authored-by: Mayukha Vadari <mvadari@ripple.com> Co-authored-by: Vito Tumas <5780819+Tapanito@users.noreply.github.com> Co-authored-by: Denis Angell <dangell@transia.co> Co-authored-by: Wietse Wind <w.wind@ipublications.net> Co-authored-by: yinyiqian1 <yqian@ripple.com> Co-authored-by: Jingchen <a1q123456@users.noreply.github.com> Co-authored-by: brettmollin <brettmollin@users.noreply.github.com>

ximinez changed the title ~~Drop out of consensus if the round takes too long~~ Failsafes to prevent a consensus round from taking too long Feb 5, 2025

ximinez requested review from Bronek, JoelKatz and vlntb February 5, 2025 19:01

Bronek reviewed Feb 5, 2025

View reviewed changes

src/xrpld/consensus/DisputedTx.h Outdated Show resolved Hide resolved

Bronek reviewed Feb 5, 2025

View reviewed changes

src/xrpld/consensus/Consensus.cpp Outdated Show resolved Hide resolved

ximinez force-pushed the ximinez/consensus branch from f07992d to 76c27a0 Compare February 5, 2025 22:05

ximinez marked this pull request as ready for review February 5, 2025 23:16

ximinez requested a review from Bronek February 6, 2025 00:02

Bronek approved these changes Feb 6, 2025

View reviewed changes

ximinez force-pushed the ximinez/consensus branch 4 times, most recently from 6e513d9 to 26ab221 Compare February 11, 2025 01:15

bthomee added this to the 2.4.0 (Q1 2025) milestone Feb 11, 2025

ximinez force-pushed the ximinez/consensus branch 2 times, most recently from 8dcbf91 to a6d3cea Compare February 12, 2025 04:12

ximinez added 2 commits February 11, 2025 23:13

Drop out of consensus if the round takes too long

60de826

ximinez force-pushed the ximinez/consensus branch from a6d3cea to 197356b Compare February 12, 2025 04:13

ximinez added 2 commits February 12, 2025 00:18

[WIP] Consensus tests

8ad2cb3

[WIP] Fix builds

7e2fa5c

vlntb requested changes Feb 12, 2025

View reviewed changes

vlntb reviewed Feb 12, 2025

View reviewed changes

src/xrpld/consensus/ConsensusParms.h Outdated Show resolved Hide resolved

vlntb reviewed Feb 12, 2025

View reviewed changes

src/test/consensus/Consensus_test.cpp Show resolved Hide resolved

ximinez added 4 commits February 12, 2025 11:35

Merge remote-tracking branch 'upstream/develop' into ximinez/consensus

196f6b6

* upstream/develop: chore: Rename missing-commits job, and combine nix job files (5268)

Update levelization

5108e55

Make ConsensusParms const, remove unnecessary copy assignment

5889dc5

Merge remote-tracking branch 'upstream/develop' into ximinez/consensus

69c1b00

* upstream/develop: fix: Replace charge() by fee_.update() in OnMessage functions (5269) docs: ensure build_type and CMAKE_BUILD_TYPE match (5274) chore: Fix small typos in protocol files (5279)

Bronek reviewed Mar 4, 2025

View reviewed changes

src/xrpld/consensus/Consensus.h Outdated Show resolved Hide resolved

Bronek self-requested a review March 4, 2025 15:36

bthomee requested a review from vlntb March 4, 2025 21:20

Bronek reviewed Mar 5, 2025

View reviewed changes

src/xrpld/consensus/DisputedTx.h Show resolved Hide resolved

Bronek reviewed Mar 5, 2025

View reviewed changes

src/xrpld/consensus/Consensus.cpp Outdated Show resolved Hide resolved

Bronek reviewed Mar 5, 2025

View reviewed changes

src/test/consensus/Consensus_test.cpp Outdated Show resolved Hide resolved

Bronek approved these changes Mar 5, 2025

View reviewed changes

Bronek reviewed Mar 5, 2025

View reviewed changes

src/xrpld/consensus/Consensus.cpp Outdated Show resolved Hide resolved

ximinez added 6 commits March 11, 2025 11:33

Merge remote-tracking branch 'upstream/develop' into ximinez/consensus

8a3341f

* upstream/develop: Set version to 2.4.0 Set version to 2.4.0-rc4 chore: Update XRPL Foundation Validator List URL (5326)

Merge remote-tracking branch 'upstream/develop' into ximinez/consensus

88cf631

* upstream/develop: refactor: Remove unused and add missing includes (5293)

Fix formatting

89c62b1

Merge branch 'develop' into ximinez/consensus

1d896c3

Fix formatting

75e0880

Bronek reviewed Mar 13, 2025

View reviewed changes

vlntb approved these changes Mar 13, 2025

View reviewed changes

Review feedback from @Bronek: fix assert message

081eee4

ximinez added Ready to merge *PR author* thinks it's ready to merge. Has passed code review. Perf sign-off may still be required. and removed Ready to merge *PR author* thinks it's ready to merge. Has passed code review. Perf sign-off may still be required. labels Mar 17, 2025

ximinez added 2 commits March 17, 2025 16:23

Merge branch 'develop' into ximinez/consensus

0bcc318

Remove the "testDisjointNetwork" test

ab05d2b

ximinez added the Ready to merge *PR author* thinks it's ready to merge. Has passed code review. Perf sign-off may still be required. label Mar 17, 2025

ximinez added 2 commits March 18, 2025 20:44

Merge branch 'develop' into ximinez/consensus

c4baa7a

Merge branch 'develop' into ximinez/consensus

76ef214

ximinez merged commit d22a505 into develop Mar 20, 2025
24 checks passed

ximinez deleted the ximinez/consensus branch March 20, 2025 16:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failsafes to prevent a consensus round from taking too long #5277

Failsafes to prevent a consensus round from taking too long #5277

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

	"ripple::getNeededWeight next state valid");
	"ripple::getNeededWeight : next state valid");

Failsafes to prevent a consensus round from taking too long #5277

Failsafes to prevent a consensus round from taking too long #5277

Uh oh!

Conversation

Uh oh!

High Level Overview of Change

Context of Change

Type of Change

Before / After

Uh oh!

Uh oh!

Codecov Report

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!