HADOOP-13327 Output Stream Specification #575

steveloughran · 2019-03-08T11:54:37Z

HADOOP-13327 Output Stream Specification

Change-Id: I1b6bc258a40a8bd57879d9edc3e5bb1303f0fff2

steveloughran · 2019-03-09T15:47:57Z

HDFS test have run out of memory


[ERROR] Tests run: 2, Failures: 0, Errors: 2, Skipped: 0, Time elapsed: 64.999 s <<< FAILURE! - in org.apache.hadoop.TestRefreshCallQueue
[ERROR] testRefreshCallQueueWithFairCallQueue(org.apache.hadoop.TestRefreshCallQueue)  Time elapsed: 4.751 s  <<< ERROR!
java.lang.OutOfMemoryError: unable to create new native thread
	at java.lang.Thread.start0(Native Method)
	at java.lang.Thread.start(Thread.java:717)
	at org.apache.hadoop.fs.FileSystem$Statistics.<clinit>(FileSystem.java:3707)
	at org.apache.hadoop.fs.FileSystem.getStatistics(FileSystem.java:4175)
	at org.apache.hadoop.fs.FileSystem.initialize(FileSystem.java:295)
	at org.apache.hadoop.fs.RawLocalFileSystem.initialize(RawLocalFileSystem.java:99)
	at org.apache.hadoop.fs.LocalFileSystem.initialize(LocalFileSystem.java:47)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3324)
	at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:136)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3373)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3341)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:491)
	at org.apache.hadoop.fs.FileSystem.getLocal(FileSystem.java:447)
	at org.apache.hadoop.hdfs.server.datanode.checker.StorageLocationChecker.check(StorageLocationChecker.java:159)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.makeInstance(DataNode.java:2785)
	at org.apache.hadoop.hdfs.server.datanode.DataNode.instantiateDataNode(DataNode.java:2698)
	at org.apache.hadoop.hdfs.MiniDFSCluster.startDataNodes(MiniDFSCluster.java:1697)
	at org.apache.hadoop.hdfs.MiniDFSCluster.initMiniDFSCluster(MiniDFSCluster.java:913)

steveloughran · 2019-03-09T15:48:32Z

I consider that to be an HDFS problem: not mine. This test is ready for review

bgaborg

This is the first pass of the review. I haven't run any integration tests yet.

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/Syncable.java

hadoop-common-project/hadoop-common/src/site/markdown/filesystem/outputstream.md

...ct/hadoop-common/src/test/java/org/apache/hadoop/fs/contract/AbstractContractCreateTest.java

hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/CanSetDropBehind.java

...-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/StoreImplementationUtils.java

...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/StreamStateModel.java

busbey · 2019-03-11T16:12:24Z

hadoop-common-project/hadoop-common/src/site/markdown/filesystem/outputstream.md

+This algorithm works for filesystems which are consistent with metadata and
+data, as well as HDFS. What is important to know is that, in HDFS
+`getFileStatus(FS, path).getLen()==0` does not imply that `data(FS, path)` is
+empty.


only if the file is still open for writing, correct? Is it possible to tell if a given path is open for writing?

no way, I'm afraid.

hadoop-common-project/hadoop-common/src/site/markdown/filesystem/outputstream.md

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ABlockOutputStream.java

hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md

ehiggs · 2019-03-11T08:54:21Z

hadoop-common-project/hadoop-common/src/site/markdown/filesystem/filesystem.md

+and that of filesystems, as it allows >1 client to create a file with `overwrite==false`,
+and potentially confuse file/directory logic. In particular, using `create()` to acquire
+an exclusive lock on a file (whoever creates the file without an error is considered
+the holder of the lock) is not a valid algorithm when working with object stores.


...is not a valid algorithm when working with object stores.

So shouldn't this be disallowed as the feature is not supported?

...ct/hadoop-common/src/test/java/org/apache/hadoop/fs/contract/AbstractFSContractTestBase.java

...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/StreamStateModel.java

ehiggs · 2019-03-11T09:01:04Z

hadoop-tools/hadoop-aws/src/main/java/org/apache/hadoop/fs/s3a/S3ABlockOutputStream.java

    if (len == 0) {
      return;
    }
+    acquireLock(true);


This can lock and then throw. Should it be inside the try so the finally block can unlock?

moving this all to a separate patch, eventually

hadoop-common-project/hadoop-common/src/site/markdown/filesystem/outputstream.md

ehiggs · 2019-03-11T22:52:00Z

hadoop-common-project/hadoop-common/src/site/markdown/filesystem/outputstream.md

+1. If a file was created with `overwrite = True`, the existing data my still
+be visible: `data(FS', path) = data(FS, path)`.
+
+1. The check for existing data in a `create()` call with `overwrite=False`, may


To be consistent with the overwrite== above in the document...

Suggested change

1. The check for existing data in a `create()` call with `overwrite=False`, may

1. The check for existing data in a `create()` call with `overwrite==false`, may

...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/StreamStateModel.java

steveloughran · 2019-06-07T12:06:08Z

just rebased to trunk; not tested. I see there's comments I've got to address -will do

hadoop-yetus · 2019-06-07T15:41:52Z

...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/StreamStateModel.java

+  }
+
+  /**
+   * Create for a path. 


whitespace:end of line

hadoop-yetus · 2019-06-07T15:41:54Z

...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/StreamStateModel.java

+  /**
+   * Create for a path. 
+   * @param path optional path for exception messages.
+   */  


whitespace:end of line

hadoop-yetus · 2019-06-07T15:41:56Z

...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/StreamStateModel.java

+/**
+ * Models a stream's state and can be used for checking this state before
+ * any operation.
+ * 


whitespace:end of line

hadoop-yetus · 2019-06-07T15:41:57Z

...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/StreamStateModel.java

+  private IOException exception;
+
+  /**
+   * Create for a path. 


whitespace:end of line

hadoop-yetus · 2019-07-19T18:05:36Z

...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/StreamStateModel.java

+  }
+
+  /**
+   * Create for a path. 


whitespace:end of line

hadoop-yetus · 2019-07-19T18:05:37Z

...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/StreamStateModel.java

+  /**
+   * Create for a path. 
+   * @param path optional path for exception messages.
+   */  


whitespace:end of line

hadoop-yetus · 2019-07-19T18:05:39Z

...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/StreamStateModel.java

+/**
+ * Models a stream's state and can be used for checking this state before
+ * any operation.
+ * 


whitespace:end of line

hadoop-yetus · 2019-07-19T18:05:40Z

...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/StreamStateModel.java

+  private IOException exception;
+
+  /**
+   * Create for a path. 


whitespace:end of line

…vro schema types Author: Aditya Toomula <atoomula@linkedin.com> Reviewers: Srinivasulu Punuru <spunuru@linkedin.com> Closes apache#575 from atoomula/bytes1 and squashes the following commits: 855a03d7 [Aditya Toomula] Fix SamzaSql AvroRelConverter to handle BYTES and FIXED avro schema types df4886d8 [Aditya Toomula] Fix SamzaSql AvroRelConverter to handle BYTES and FIXED avro schema types 80268fc1 [Aditya Toomula] Fix SamzaSql AvroRelConverter to handle BYTES and FIXED avro schema types

Change-Id: I1b6bc258a40a8bd57879d9edc3e5bb1303f0fff2

* Address code-side comments from reviewers; markdown is still WiP. * move all inner stream propagation of hasCapability() tests to the StoreImplementationUtils operation. Change-Id: Ic5be4ac09ddd36b8804eacfee5ad0587a819eaf0

steveloughran · 2019-11-01T14:48:14Z

Had some time to look at the comments here after too long break; more work is needed

Output stream spec

Needs more review of the comments, especiallys Sean's

Stream Model + impl.

I now suspect that I need to do a bit more resilience in the failure handling of S3ABlockOutputStream.

Specifically: if 1+ POST has failed in a multipart write, we should make a best effort attempt to abort the upload. Admittedly, the same network or authentication errors which cause the post to fail are likely to cause the abort operation to fail too -that should not stop us up from at least trying.

Proposed: close() will attempt to abort the upload if a failure occurred earlier.

This is going to be fun to test. Indeed, it's a good argument for WriteOperationHelper to be converted to an interface with a mocking implementation alongside the production one. I might play with some subclassing instead;

Plan

I think these two bits of work can be separated; the S3ABlockOutput stream model/locking/aborting is complex enough to merit isolation.

Will follow up with that in a separate PR while retaining this purely for the output stream spec and tests.

This one doesn't apply to trunk no more; once I have gone through the comments I may produce a new PR for the output stream changes too.

Change-Id: I92788063536a4bf15ee28f17187c7e09fa09940f

hadoop-yetus · 2019-11-01T19:49:40Z

...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/StreamStateModel.java

+  private IOException exception;
+
+  /**
+   * Create for a path. 


whitespace:end of line

hadoop-yetus · 2019-11-01T19:49:41Z

...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/StreamStateModel.java

+  }
+
+  /**
+   * Create for a path. 


whitespace:end of line

hadoop-yetus · 2019-11-01T19:49:43Z

...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/StreamStateModel.java

+  /**
+   * Create for a path. 
+   * @param path optional path for exception messages.
+   */  


whitespace:end of line

steveloughran · 2019-11-05T17:37:52Z

I'm closing this and opening up a peer without any of the S3A changes, #1694. It currently addresses most of the comments of people's reviews, excluding Sean's request about restructuring the model

Could this be in a section about visibility and not in the model definition? Maybe later. "here's the model, here's how that model works with creation, here's how it works when reading/writing" flows much better and visibility should be in that third part.

bgaborg requested changes Mar 11, 2019

View reviewed changes

busbey reviewed Mar 11, 2019

View reviewed changes

ehiggs reviewed Mar 11, 2019

View reviewed changes

ajfabbri reviewed Mar 14, 2019

View reviewed changes

...p-common-project/hadoop-common/src/main/java/org/apache/hadoop/fs/impl/StreamStateModel.java Show resolved Hide resolved

steveloughran force-pushed the filesystem/HADOOP-13327-outputstream-spec branch from 13dc093 to 8947901 Compare March 22, 2019 22:49

steveloughran mentioned this pull request Mar 27, 2019

HDFS-3246: pRead equivalent for direct read path #597

Merged

steveloughran force-pushed the filesystem/HADOOP-13327-outputstream-spec branch from 8947901 to 455203f Compare June 7, 2019 12:02

hadoop-yetus reviewed Jun 7, 2019

View reviewed changes

hadoop-yetus reviewed Jul 19, 2019

View reviewed changes

apache deleted a comment from hadoop-yetus Sep 3, 2019

steveloughran and others added 2 commits October 22, 2019 14:22

HADOOP-13327 Output Stream Specification

af80562

Change-Id: I1b6bc258a40a8bd57879d9edc3e5bb1303f0fff2

HADOOP-13327 update on the OutputStream specification

8a2154d

* Address code-side comments from reviewers; markdown is still WiP. * move all inner stream propagation of hasCapability() tests to the StoreImplementationUtils operation. Change-Id: Ic5be4ac09ddd36b8804eacfee5ad0587a819eaf0

steveloughran force-pushed the filesystem/HADOOP-13327-outputstream-spec branch from 455203f to 8a2154d Compare October 22, 2019 13:22

HADOOP-13327 review of past comments

75b2f01

Change-Id: I92788063536a4bf15ee28f17187c7e09fa09940f

hadoop-yetus reviewed Nov 1, 2019

View reviewed changes

steveloughran mentioned this pull request Nov 5, 2019

HADOOP-13327 Output Stream Specification. #1694

Closed

	1. The check for existing data in a `create()` call with `overwrite=False`, may
	1. The check for existing data in a `create()` call with `overwrite==false`, may

HADOOP-13327 Output Stream Specification #575

HADOOP-13327 Output Stream Specification #575

Uh oh!

Conversation

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Output stream spec

Stream Model + impl.

Plan

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!