-
Notifications
You must be signed in to change notification settings - Fork 3.9k
release-24.3: changefeedccl: fix premature shutdown due to schema change bug #147031
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-24.3: changefeedccl: fix premature shutdown due to schema change bug #147031
Conversation
This patch adds more verbose logging to the change aggregator around receiving and emitting resolved spans to help debug recurring changefeed schema change test flakes. Release note: None
Thanks for opening a backport. Please check the backport criteria before merging:
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
Also, please add a brief release justification to the body of your PR to justify this |
It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR? 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
fedef9d
to
a4309b9
Compare
This patch fixes a bug that could potentially cause a changefeed to erroneously complete when one of its watched tables encounters a schema change has been fixed. The root cause for the bug was that if we happened to get a rangefeed checkpoint at precisely `ts.Prev()` for some schema change timestamp `ts`, the kv feed would deliver a resolved span with `ts` and a NONE boundary to the change aggregator, which would advance its frontier; then when the resolved span with `ts` and a RESTART boundary was sent to the change aggregator, the frontier would not be advanced and so would not be flushed to the change frontier. The change frontier would then read a nil row from the change aggregator and shut the changefeed down as if it had completed successfully. This bug has been fixed by modifying the resolved span frontier forwarding logic to consider forwarding to the current timestamp but with a non-NONE boundary type to be advancing the frontier. Two alternative solutions that were ruled out were: 1. Unconditionally flushing the frontier when we get a non-NONE boundary instead of only flushing when the frontier is advanced. - Problem: We would flush the frontier O(spans) number of times. 2. Making the kv feed not emit resolved events for rangefeed checkpoints that are at a schema change boundary. - Problem: We wouldn't be able to save the per-span progress at the schema change boundary. Release note (bug fix): A bug that could potentially cause a changefeed to erroneously complete when one of its watched tables encounters a schema change has been fixed.
This patch adds an assertion that the frontier will not forward the boundary to a different type at the same time. This assertion is important because if it's violated, the changefeed processors will not shut down correctly during a schema change. One potential way this could be violated in the future is a change to how boundary types are determined on aggregators causing different boundaries to be sent from aggregators in a mixed-version cluster for the same schema change. Release note: None
a4309b9
to
70c49d1
Compare
Used EngFlow to stress all the tests in the |
Backport 3/3 commits from #144004.
/cc @cockroachdb/release
Fixes #144108
Test failures on master:
Fixes #143976
Fixes #144045
Fixes #144219
Test failures on release branches (won't be fixed until PR is backported):
Informs #144287
Informs #144291
Informs #144352
changefeedccl: add more verbose logging around schema changes
This patch adds more verbose logging to the change aggregator around
receiving and emitting resolved spans to help debug recurring changefeed
schema change test flakes.
Release note: None
changefeedccl: fix premature shutdown due to schema change bug
This patch fixes a bug that could potentially cause a changefeed
to erroneously complete when one of its watched tables encounters a
schema change has been fixed.
The root cause for the bug was that if we happened to get a rangefeed
checkpoint at precisely
ts.Prev()
for some schema change timestampts
, the kv feed would deliver a resolved span withts
and a NONEboundary to the change aggregator, which would advance its frontier;
then when the resolved span with
ts
and a RESTART boundary wassent to the change aggregator, the frontier would not be advanced and
so would not be flushed to the change frontier. The change frontier
would then read a nil row from the change aggregator and shut the
changefeed down as if it had completed successfully.
This bug has been fixed by modifying the resolved s 8000 pan frontier
forwarding logic to consider forwarding to the current timestamp
but with a non-NONE boundary type to be advancing the frontier.
Two alternative solutions that were ruled out were:
instead of only flushing when the frontier is advanced.
that are at a schema change boundary.
schema change boundary.
Release note (bug fix): A bug that could potentially cause a changefeed
to erroneously complete when one of its watched tables encounters a
schema change has been fixed.
changefeedccl/resolvedspan: update assertion for forwarding boundary
This patch adds an assertion that the frontier will not forward the
boundary to a different type at the same time. This assertion is
important because if it's violated, the changefeed processors will not
shut down correctly during a schema change. One potential way this
could be violated in the future is a change to how boundary types are
determined on aggregators causing different boundaries to be sent from
aggregators in a mixed-version cluster for the same schema change.
Release note: None
Release justification: low-risk bug fix for high-priority bug