8000 release-24.3: changefeedccl: fix premature shutdown due to schema change bug by andyyang890 · Pull Request #147031 · cockroachdb/cockroach · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

release-24.3: changefeedccl: fix premature shutdown due to schema change bug #147031

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
May 24, 2025

Conversation

andyyang890
Copy link
Collaborator

Backport 3/3 commits from #144004.

/cc @cockroachdb/release


Fixes #144108

Test failures on master:
Fixes #143976
Fixes #144045
Fixes #144219

Test failures on release branches (won't be fixed until PR is backported):
Informs #144287
Informs #144291
Informs #144352


changefeedccl: add more verbose logging around schema changes

This patch adds more verbose logging to the change aggregator around
receiving and emitting resolved spans to help debug recurring changefeed
schema change test flakes.

Release note: None


changefeedccl: fix premature shutdown due to schema change bug

This patch fixes a bug that could potentially cause a changefeed
to erroneously complete when one of its watched tables encounters a
schema change has been fixed.

The root cause for the bug was that if we happened to get a rangefeed
checkpoint at precisely ts.Prev() for some schema change timestamp
ts, the kv feed would deliver a resolved span with ts and a NONE
boundary to the change aggregator, which would advance its frontier;
then when the resolved span with ts and a RESTART boundary was
sent to the change aggregator, the frontier would not be advanced and
so would not be flushed to the change frontier. The change frontier
would then read a nil row from the change aggregator and shut the
changefeed down as if it had completed successfully.

This bug has been fixed by modifying the resolved s 8000 pan frontier
forwarding logic to consider forwarding to the current timestamp
but with a non-NONE boundary type to be advancing the frontier.

Two alternative solutions that were ruled out were:

  1. Unconditionally flushing the frontier when we get a non-NONE boundary
    instead of only flushing when the frontier is advanced.
    • Problem: We would flush the frontier O(spans) number of times.
  2. Making the kv feed not emit resolved events for rangefeed checkpoints
    that are at a schema change boundary.
    • Problem: We wouldn't be able to save the per-span progress at the
      schema change boundary.

Release note (bug fix): A bug that could potentially cause a changefeed
to erroneously complete when one of its watched tables encounters a
schema change has been fixed.


changefeedccl/resolvedspan: update assertion for forwarding boundary

This patch adds an assertion that the frontier will not forward the
boundary to a different type at the same time. This assertion is
important because if it's violated, the changefeed processors will not
shut down correctly during a schema change. One potential way this
could be violated in the future is a change to how boundary types are
determined on aggregators causing different boundaries to be sent from
aggregators in a mixed-version cluster for the same schema change.

Release note: None


Release justification: low-risk bug fix for high-priority bug

This patch adds more verbose logging to the change aggregator around
receiving and emitting resolved spans to help debug recurring changefeed
schema change test flakes.

Release note: None
Copy link
blathers-crl bot commented May 20, 2025

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label May 20, 2025
Copy link
blathers-crl bot commented May 20, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@andyyang890 andyyang890 force-pushed the backport24.3-144004 branch from fedef9d to a4309b9 Compare May 20, 2025 20:37
This patch fixes a bug that could potentially cause a changefeed
to erroneously complete when one of its watched tables encounters a
schema change has been fixed.

The root cause for the bug was that if we happened to get a rangefeed
checkpoint at precisely `ts.Prev()` for some schema change timestamp
`ts`, the kv feed would deliver a resolved span with `ts` and a NONE
boundary to the change aggregator, which would advance its frontier;
then when the resolved span with `ts` and a RESTART boundary was
sent to the change aggregator, the frontier would not be advanced and
so would not be flushed to the change frontier. The change frontier
would then read a nil row from the change aggregator and shut the
changefeed down as if it had completed successfully.

This bug has been fixed by modifying the resolved span frontier
forwarding logic to consider forwarding to the current timestamp
but with a non-NONE boundary type to be advancing the frontier.

Two alternative solutions that were ruled out were:
1. Unconditionally flushing the frontier when we get a non-NONE boundary
   instead of only flushing when the frontier is advanced.
     - Problem: We would flush the frontier O(spans) number of times.
2. Making the kv feed not emit resolved events for rangefeed checkpoints
   that are at a schema change boundary.
     - Problem: We wouldn't be able to save the per-span progress at the
       schema change boundary.

Release note (bug fix): A bug that could potentially cause a changefeed
to erroneously complete when one of its watched tables encounters a
schema change has been fixed.
This patch adds an assertion that the frontier will not forward the
boundary to a different type at the same time. This assertion is
important because if it's violated, the changefeed processors will not
shut down correctly during a schema change. One potential way this
could be violated in the future is a change to how boundary types are
determined on aggregators causing different boundaries to be sent from
aggregators in a mixed-version cluster for the same schema change.

Release note: None
@andyyang890 andyyang890 force-pushed the backport24.3-144004 branch from a4309b9 to 70c49d1 Compare May 20, 2025 20:43
@andyyang890 andyyang890 requested a review from asg0451 May 21, 2025 12:31
@andyyang890 andyyang890 marked this pull request as ready for review May 21, 2025 12:31
@andyyang890 andyyang890 requested a review from a team as a code owner May 21, 2025 12:31
@andyyang890
Copy link
Collaborator Author

Used EngFlow to stress all the tests in the pkg/ccl/changefeedccl package 1000 times and no failures: https://tanzanite.cluster.engflow.com/invocation/7efcdfe2-f1ac-4b2b-8806-e39735ac75ed

@andyyang890 andyyang890 merged commit 70ae93b into cockroachdb:release-24.3 May 24, 2025
15 checks passed
@andyyang890 andyyang890 deleted the backport24.3-144004 branch May 24, 2025 02:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches target-release-24.3.15
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants
0