8000 Fix frequent restart read required error in test automation framework from YB server by sanyamsinghal · Pull Request #2629 · yugabyte/yb-voyager · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Fix frequent restart read required error in test automation framework from YB server #2629

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
May 22, 2025

Conversation

sanyamsinghal
Copy link
Collaborator
@sanyamsinghal sanyamsinghal commented May 19, 2025

Describe the changes in this pull request

RCA
[pg-indexes test schema: 11 tables, 11 indexes, multiples tables with 100k rows each] For each test validations, we are starting a single transaction and firing count(*) for all tables Which in result might lead to reading each tablet, given tablets are large in number - the probability of hitting it is high with single transaction. So, our transaction is failing on SELECT query because the tablet being read for that table didn't get update safe time(multiple possible reason).

Error message:
psycopg2.errors.SerializationFailure: Restart read required at: { read: { physical: 1747314423169478 } local_limit: { physical: 1747314423169468 logical: 4095 } global_limit: <min> in_txn_limit: <max> serial_no: 0 } (query layer retry isn't possible because data was already sent, if this is the read committed isolation (or) the first statement in repeatable read/ serializable isolation

Here, notice that read: { physical: 1747314423169478 } is the transaction read time local_limit: { physical: 1747314423169468 logical: 4095 } is safe time for the tablet. where read > local_limit

Fix

  1. Add a sleep of 1.1 sec before validations so that all tablet's safe time gets updated for sure. [assuming it updates every 1 sec, screenshot attached below]
  2. Statement level txn(autocommit=true) - this is a generic guideline also i guess for YB and a proper fix, but we decided to do later because:
    • Validate our understanding of safe time scenario. (test run over few days/weeks without failure would confirm that)
    • Updated tests written with assumption of autocommit=false.

This PR has 1 from the above 2 suggested fixes
Refer #2626 for the other change...

Sleep for 1100ms: spans at least one full 500ms Raft heartbeat plus 500ms skew,
so that tablet safe‐time ≥ (now – skew) and we avoid “restart read required”.

Describe if there are any user-facing changes

How was this pull request tested?

Does your PR have changes in callhome/yugabyted payloads? If so, is the payload version incremented?

Does your PR have changes that can cause upgrade issues?

Component Breaking changes?
MetaDB Yes/No
Name registry json Yes/No
Data File Descriptor Json Yes/No
Export Snapshot Status Json Yes/No
Import Data State Yes/No
Export Status Json Yes/No
Data .sql files of tables Yes/No
Export and import data queue Yes/No
Schema Dump Yes/No
AssessmentDB Yes/No
Sizing DB Yes/No
Migration Assessment Report Json Yes/No
Callhome Json Yes/No
YugabyteD Tables Yes/No
TargetDB Metadata Tables Yes/No

… from YB server

RCA
[pg-indexes test schema:  11 tables, 11 indexes, multiples tables with 100k rows each]
For each test validations, we are starting a single transaction and firing count(*) for all tables
Which in result might lead to reading each tablet, given tablets are large in number - the probability of hitting it is high with single transaction.
So, our transaction is failing on SELECT query because the tablet being read for that table didn't get update safe time(multiple possible reason).

Error message:
psycopg2.errors.SerializationFailure: Restart read required at: { read: { physical: 1747314423169478 } local_limit: { physical: 1747314423169468 logical: 4095 }
global_limit: <min> in_txn_limit: <max> serial_no: 0 } (query layer retry isn't possible because data was already sent, if this is the read committed isolation
(or) the first statement in repeatable read/ serializable isolation

Here, notice that read: { physical: 1747314423169478 } is the transaction read time
local_limit: { physical: 1747314423169468 logical: 4095 } is safe time for the tablet.
where read > local_limit

Fix
1. Add a sleep of 1.1 sec before validations so that all tablet's safe time gets updated for sure. [assuming it updates every 1 sec, screenshot attached below]
2. Statement level txn(autocommit=true) - this is a generic guideline also i guess for YB and a proper fix, but we decided to do later because:
  - Validate our understanding of safe time scenario. (test run over few days/weeks without failure would confirm that)
  - Updated tests written with assumption of autocommit=false.
@sanyamsinghal sanyamsinghal self-assigned this May 19, 2025
@sanyamsinghal sanyamsinghal merged commit 4b7c783 into main May 22, 2025
62 of 63 checks passed
@sanyamsinghal sanyamsinghal deleted the sanyam/automation-fix-restart-read-fix1 branch May 22, 2025 06:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0