-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HDFS-17801. EC: Reading support retryCurrentNode to avoid transient errors cause application level failures. #7762
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
b91d1b9
to
6c144bb
Compare
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
6c144bb
to
f52be54
Compare
💔 -1 overall
This message was automatically generated. |
f52be54
to
bd73e6c
Compare
🎊 +1 overall
This message was automatically generated. |
Hi, @Hexiaoqiao @KeeProMise @zhangshuyan0 . Could you please help review this PR when you are free ? Thanks a lot. |
bd73e6c
to
b89b1d0
Compare
🎊 +1 overall
This message was automatically generated. |
b89b1d0
to
a44f887
Compare
…rrors cause application level failures.
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR introduces transient-error resilience to EC read paths by adding a retry mechanism modeled after the 3-replication read implementation. Key changes include:
- Adding unit tests to simulate long idle periods and validate the retry behavior.
- Updating StripeReader logic to conditionally reset chunk states and use additional reader info checks.
- Introducing and integrating a retryCurrentReaderFlags mechanism in DFSStripedInputStream to control reader retries.
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
File | Description |
---|---|
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestDFSStripedInputStream.java | Added new tests for EC read retry behavior. |
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java | Enhanced handling of missing blocks and integrated logic to account for transient errors. |
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSStripedInputStream.java | Introduced retryCurrentReaderFlags for controlling reader retry behavior in case of transient errors. |
Comments suppressed due to low confidence (2)
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/StripeReader.java:179
- Consider adding a comment explaining the rationale behind comparing countOfNullReaderInfos with parityBlkNum, clarifying how this check differentiates between transient errors and genuine missing blocks.
if (countOfNullReaderInfos(readerInfos) < parityBlkNum) {
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSStripedInputStream.java:233
- Add an inline comment to explain why a reader is skipped only when its corresponding retry flag is false, which will help improve code clarity and maintainability.
if (!retryCurrentReaderFlags[readerIndex]) {
a44f887
to
299330d
Compare
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for your contribution. cc @zhangshuyan0 Would you mind to review again?
Thanks @hfutatzhanghb report. |
Uh oh!
There was an error while loading. Please reload this page.
Description of PR
Refer to HDFS-17081.
Under the 3-replication read implementation, when an IOException occurs, there is the retryCurrentNode mechanism.
This is very useful to avoid application level failures due to transient errors (e.g. Datanode could have closed the connection because the client is idle for too long). Please refer to below codes :
hadoop/hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java
Lines 824 to 828 in 6eae158
We should make EC read also support this mechanism.
BTW, this issue is motivated by the failure of our cluster's applications failure when we change the data from 3-rep to EC policy.
How was this patch tested?
Add an unit test.