-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HDFS-17365. EC: Add extra redunency configuration in checkStreamerFailures to prevent data loss. #6517
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: trunk
Are you sure you want to change the base?
Conversation
96bf0b1
to
510fce0
Compare
@Hexiaoqiao @zhangshuyan0 @tasanuma @tomscut Hi, sir. Could you please help me review this PR when you are free ? Thanks a lot! |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
510fce0
to
60c23a4
Compare
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good idea for me. I think we should allow different EC strategies to use different sizes of extraStreamerRedunency
. May be a structure like ReplaceDatanodeOnFailure
? Or may just contain EC schema name in configuration key?
Nice opinion, Sir. Give me some time to think about how to design it. Hope we can push this idea forward. Thanks a lot! |
60c23a4
to
55fcce0
Compare
💔 -1 overall
This message was automatically generated. |
@@ -3908,6 +3908,18 @@ | |||
</description> | |||
</property> | |||
|
|||
<property> | |||
<name>dfs.client.ec.EXAMPLEECPOLICYNAME.checkstreamer.redunency</name> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@hfutatzhanghb Thanks for the PR. I think it's a good feature.
In my honest opinion, dfs.client.ec.EXAMPLEECPOLICYNAME.checkstreamer.redunency
is counter-intuitive. I prefer a setting that interprets values in a reverse way. In other words, it would be something like dfs.client.ec.EXAMPLEECPOLICYNAME.failed.write.block.tolerated
, where if the value is 0, then no failures are tolerated. And if it's 3, we can tolerate up to 3 failures in block writing. If the setting value is empty (it would be default), we can tolerate failures up to the number of parity blocks. This is just my personal view.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tasanuma Sir, thanks a lot for your reviewing and valuable opinion. Agree with you, We should make the configuration become more intuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tasanuma @zhangshuyan0 Hi, sir. Sorry for response too late here. I have uploaded a new patch according to the review opinions, please help me review this PR when you have bandwidth. Thanks a lot~
d73f2ca
to
d4701f0
Compare
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
68e46d9
to
aca8b62
Compare
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
The checkstyle warning can be ignored. |
💔 -1 overall
This message was automatically generated. |
…lures to prevent data loss.
2a2bcd2
to
4daec24
Compare
@Hexiaoqiao @zhangshuyan0 @tasanuma Sir, sorry for pushing this PR forward too late. Could you please help review this PR when you are free? Thanks. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
The checkstyle warning can be ignored. |
Description of PR
Refer to HDFS-17365.
Currently, when we write ec files. It can tolerate with at most
numAllBlocks - numDataBlocks
failed streamers.However, this can lead to block missing exception in some case.
For example, When we use RS-6-3-1024K ec policy, we have written 6 blocks successfully.
this block is going to be reconstructed, but before that, one of the 6 blocks lost, we may lose the data forerver.
Or if we restart datanode at that time, client will get block missing exception.
So, we should better add some redundancy here.