Fix frequent restart read required error in test automation framework from YB server #2629
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Describe the changes in this pull request
RCA
[pg-indexes test schema: 11 tables, 11 indexes, multiples tables with 100k rows each] For each test validations, we are starting a single transaction and firing count(*) for all tables Which in result might lead to reading each tablet, given tablets are large in number - the probability of hitting it is high with single transaction. So, our transaction is failing on SELECT query because the tablet being read for that table didn't get update safe time(multiple possible reason).
Error message:
psycopg2.errors.SerializationFailure: Restart read required at: { read: { physical: 1747314423169478 } local_limit: { physical: 1747314423169468 logical: 4095 } global_limit: <min> in_txn_limit: <max> serial_no: 0 } (query layer retry isn't possible because data was already sent, if this is the read committed isolation (or) the first statement in repeatable read/ serializable isolation
Here, notice that read: { physical: 1747314423169478 } is the transaction read time local_limit: { physical: 1747314423169468 logical: 4095 } is safe time for the tablet. where read > local_limit
Fix
This PR has 1 from the above 2 suggested fixes
Refer #2626 for the other change...
Sleep for 1100ms: spans at least one full 500ms Raft heartbeat plus 500ms skew,
so that tablet safe‐time ≥ (now – skew) and we avoid “restart read required”.
Describe if there are any user-facing changes
How was this pull request tested?
Does your PR have changes in callhome/yugabyted payloads? If so, is the payload version incremented?
Does your PR have changes that can cause upgrade issues?