-
Notifications
You must be signed in to change notification settings - Fork 9.1k
HDFS-17381. Distcp of EC files should not be limited to DFS. #6551
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
don't do it this way; reflection is messy and brittle. It may work as a proof of concept but really it is an attempt to fix distcp to add yet another special case rather than make it generic.
You have the power to make changes in hadoop common -and should do this with appropriate. But do it in a way which can be used by more than just ozone. Ideally hdfs should be able to support it too, so that we can remove the HDFS special case logic entirely. Do that and deployments in azure, aws and gcs no longer need hdfs libraries on their classpath.
Look at
- HADOOP-17981/HADOOP-17981. resilient commit through etag validation #3611: EtagSource adding a new interface implemented by multiple FileStatus subclasses.
- HADOOP-18671. Add recoverLease(), setSafeMode(), isFileClosed() APIs to FileSystem HDFS-16973. RBF: MountTableResolver cache size lookup should take read lock #5533. This was explicitly done for Ozone.
Also createFile() has a builder API where options can be set using opt(k, v) and must(k, v) to declare options which may or must be understood by implementing FS. All you should require at create time is a new option which takes an integer enum, boolean or string. In #3289 we added an s3a option "fs.s3a.create.performance" as an example of this. Unique to s3a, does not require any changes in applications other than use of an API which has been present since hadoop 2.9.
If you 8000 need a new field in FileStatus/LocatedFileStatus then define a new interface with implementations in hdfs (looks like the existing ones can't be pulled up as they use stuff in hdfs packages), and ozone.
If you need new methods in the fs, add them similarly and implement in hdfs and ozone filesystem subclasses.
And if it is new createFile options, look at the s3a createFile support and do something similar. This should be straightforward and it is precisely why we are using the builder API in so many of the file system operations now: to give fall system implementation flexibility without requiring new changes in common etc and better cross-version support.
For all of this you wil also need to define semantics and add to filesystem specification. Work out what hdfs does, document and write contract tests for hdfs and ozone and anything else in future.
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what about just enhancing createFile() with a new option for ecpolicy in org.apache.hadoop.fs.Options.OpenFileOptions fs.option.openfile.ec.policy
.
hdfs can use this in its override of createfile, everything else can choose to ignore it (if set with opt()) or fail fast if passed in must().
You should also add a new hasPathCapabilities probe for the filesystems which will support this (using the same option name) so apps can query its availablity.
advantage of this: it's an existing and stable api, and allows for code built against existing hadoop versions to request it, at least as an option.
the WithErasureCoding interface becomes one to just get/set the policy, but not creating files
...oop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableDirectoryCreateCommand.java
Outdated
Show resolved
Hide resolved
💔 -1 overall
This message was automatically generated. |
Sorry for the long silence here, Just picked this again. @steveloughran Thanks for the review comments, I have made the changes you suggested, could you please take a look if it looks ok? I will test it with this then. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reviewed. trying to make sure that the new interface can be defined as rigorously as possible and that it is possible to take hadoop-hdfs off the classpath and have this all work between two ozone stores
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/WithErasureCoding.java
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/WithErasureCoding.java
Outdated
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/WithErasureCoding.java
Outdated
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/WithErasureCoding.java
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/WithErasureCoding.java
Show resolved
Hide resolved
...s-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java
Outdated
Show resolved
Hide resolved
...oop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableDirectoryCreateCommand.java
Show resolved
Hide resolved
...ols/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java
Outdated
Show resolved
Hide resolved
...ols/hadoop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableFileCopyCommand.java
Outdated
Show resolved
Hide resolved
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
(cherry picked from commit 3b663d7)
db67ccf
to
292e110
Compare
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
that was before hadoop 3.0 shipped. it is now stable and while not used much in those libraries that still want to run on on java2 -and because they're confident we aren't going to cut the old ones. The HDFS subclass provides many more options; S3A supports extra options such as "fs.s3a.create.performance" which turns off the safety check of a LIST for child objects. We've just been working hard to make those same libraries able to use the builder based openFile() API as it has tangible performance benefits. |
+can i encourage you to use git merge to resync with with trunk once people have started reviewing? its a lot easier to review with the "changes since your last review" option. thanks |
Sure , I will use git merge instead of force push next time. I have still maintained the commit history here even with force push. For "changes since your last review" please refer to this commit |
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Options.java
Outdated
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/WithErasureCoding.java
Outdated
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/WithErasureCoding.java
Outdated
Show resolved
Hide resolved
...s-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DistributedFileSystem.java
Show resolved
Hide resolved
...oop-distcp/src/main/java/org/apache/hadoop/tools/mapred/RetriableDirectoryCreateCommand.java
Show resolved
Hide resolved
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpWithRawXAttrs.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpWithRawXAttrs.java
Outdated
Show resolved
Hide resolved
hadoop-tools/hadoop-distcp/src/test/java/org/apache/hadoop/tools/TestDistCpWithRawXAttrs.java
Outdated
Show resolved
Hide resolved
hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/FileStatus.java
Outdated
Show resolved
Hide resolved
...project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/protocol/HdfsFileStatus.java
Outdated
Show resolved
Hide resolved
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nearly done!
other than a few checkstyle tweaks, just some tuning of that warning message -which is more likely to be viewed.
@@ -380,6 +380,8 @@ public FSDataInputStream open(PathHandle fd, int bufferSize) | |||
|
|||
@Override | |||
public String getErasureCodingPolicyName(FileStatus fileStatus) { | |||
if (!(fileStatus instanceof HdfsFileStatus)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
has to have curly braces round. Not for us, but for our successors in maintaining this code
@@ -208,8 +208,8 @@ private long copyToFile(Path targetPath, FileSystem targetFS, | |||
|
|||
String ecPolicyName = null; | |||
if (preserveEC && sourceStatus.isErasureCoded() | |||
&& checkFSSupportsEC(sourceStatus.getPath(), sourceFS) | |||
&& checkFSSupportsEC(targetPath, targetFS)) { | |||
&& checkFSSupportsEC(sourceFS,sourceStatus.getPath()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: space after ,
|
||
/** | ||
* Return true if the FS implements {@link WithErasureCoding} and | ||
* supports EC_POLICY option in {@link Options.OpenFileOptions} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- add a . at the end of this sentence
- add a second sentence saying that a message is logged when not available
fs.hasPathCapability(path, Options.OpenFileOptions.FS_OPTION_OPENFILE_EC_POLICY)) { | ||
return true; | ||
} | ||
LOG.warn("FS with scheme {} does not support EC", fs.getScheme()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- prefer "Erasure Coding" to "EC"
- do you think we should print the full path?
Thanks @steveloughran for the reviews, updated the patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, merging.
will take a backport PR to branch-3.4 too
🎊 +1 overall
This message was automatically generated. |
@sadanand48 merged to trunk; if you can do a PR on hadoop branch-3.4 then once Yetus is happy we can merge it there too |
Thanks @steveloughran , HDFS-17376 is a dependency(good to have) not present in the 3.4 branch, I have raised #7073 for that cherry-pick. Once that goes in, I will create a backport for this PR too or can we club them both in one ? |
separately please, easier to track, revert etc. |
Contributed by Sadanand Shenoy (cherry picked from commit 49a4958)
Description of PR
Change to support distcp of EC files to and from Ozone FS
How was this patch tested?
Unit tests added