Fix frequent restart read required error in test automation framework from YB server #2629

sanyamsinghal · 2025-05-19T09:18:06Z

Describe the changes in this pull request

RCA
[pg-indexes test schema: 11 tables, 11 indexes, multiples tables with 100k rows each] For each test validations, we are starting a single transaction and firing count(*) for all tables Which in result might lead to reading each tablet, given tablets are large in number - the probability of hitting it is high with single transaction. So, our transaction is failing on SELECT query because the tablet being read for that table didn't get update safe time(multiple possible reason).

Error message:
psycopg2.errors.SerializationFailure: Restart read required at: { read: { physical: 1747314423169478 } local_limit: { physical: 1747314423169468 logical: 4095 } global_limit: <min> in_txn_limit: <max> serial_no: 0 } (query layer retry isn't possible because data was already sent, if this is the read committed isolation (or) the first statement in repeatable read/ serializable isolation

Here, notice that read: { physical: 1747314423169478 } is the transaction read time local_limit: { physical: 1747314423169468 logical: 4095 } is safe time for the tablet. where read > local_limit

Fix

Add a sleep of 1.1 sec before validations so that all tablet's safe time gets updated for sure. [assuming it updates every 1 sec, screenshot attached below]
Statement level txn(autocommit=true) - this is a generic guideline also i guess for YB and a proper fix, but we decided to do later because:
- Validate our understanding of safe time scenario. (test run over few days/weeks without failure would confirm that)
- Updated tests written with assumption of autocommit=false.

This PR has 1 from the above 2 suggested fixes
Refer #2626 for the other change...

Sleep for 1100ms: spans at least one full 500ms Raft heartbeat plus 500ms skew,
so that tablet safe‐time ≥ (now – skew) and we avoid “restart read required”.

Describe if there are any user-facing changes

How was this pull request tested?

Does your PR have changes in callhome/yugabyted payloads? If so, is the payload version incremented?

Does your PR have changes that can cause upgrade issues?

Component	Breaking changes?
MetaDB	Yes/No
Name registry json	Yes/No
Data File Descriptor Json	Yes/No
Export Snapshot Status Json	Yes/No
Import Data State	Yes/No
Export Status Json	Yes/No
Data .sql files of tables	Yes/No
Export and import data queue	Yes/No
Schema Dump	Yes/No
AssessmentDB	Yes/No
Sizing DB	Yes/No
Migration Assessment Report Json	Yes/No
Callhome Json	Yes/No
YugabyteD Tables	Yes/No
TargetDB Metadata Tables	Yes/No

… from YB server RCA [pg-indexes test schema: 11 tables, 11 indexes, multiples tables with 100k rows each] For each test validations, we are starting a single transaction and firing count(*) for all tables Which in result might lead to reading each tablet, given tablets are large in number - the probability of hitting it is high with single transaction. So, our transaction is failing on SELECT query because the tablet being read for that table didn't get update safe time(multiple possible reason). Error message: psycopg2.errors.SerializationFailure: Restart read required at: { read: { physical: 1747314423169478 } local_limit: { physical: 1747314423169468 logical: 4095 } global_limit: <min> in_txn_limit: <max> serial_no: 0 } (query layer retry isn't possible because data was already sent, if this is the read committed isolation (or) the first statement in repeatable read/ serializable isolation Here, notice that read: { physical: 1747314423169478 } is the transaction read time local_limit: { physical: 1747314423169468 logical: 4095 } is safe time for the tablet. where read > local_limit Fix 1. Add a sleep of 1.1 sec before validations so that all tablet's safe time gets updated for sure. [assuming it updates every 1 sec, screenshot attached below] 2. Statement level txn(autocommit=true) - this is a generic guideline also i guess for YB and a proper fix, but we decided to do later because: - Validate our understanding of safe time scenario. (test run over few days/weeks without failure would confirm that) - Updated tests written with assumption of autocommit=false.

sanyamsinghal self-assigned this May 19, 2025

Add comment explaining sleep

5b4eac4

sanyamsinghal marked this pull request as ready for review May 19, 2025 10:23

sanyamsinghal mentioned this pull request May 19, 2025

Potential fixes for Restart Required error in automation test #2626

Draft

sanyamsinghal requested review from makalaaneesh, hbhanawat and priyanshi-yb May 19, 2025 10:25

priyanshi-yb approved these changes May 22, 2025

View reviewed changes

sanyamsinghal merged commit 4b7c783 into main May 22, 2025
62 of 63 checks passed

sanyamsinghal deleted the sanyam/automation-fix-restart-read-fix1 branch May 22, 2025 06:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix frequent restart read required error in test automation framework from YB server #2629

Fix frequent restart read required error in test automation framework from YB server #2629

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Fix frequent restart read required error in test automation framework from YB server #2629

Fix frequent restart read required error in test automation framework from YB server #2629

Uh oh!

Conversation

Uh oh!

Describe the changes in this pull request

Describe if there are any user-facing changes

How was this pull request tested?

Does your PR have changes in callhome/yugabyted payloads? If so, is the payload version incremented?

Does your PR have changes that can cause upgrade issues?

Uh oh!

Uh oh!

Uh oh!