Add tests and support S3 with tensorboard-notf #1663

orionr · 2018-12-05T19:37:24Z

As a part of building out tensorboard-notf, we need support for S3 (and our internal filesystems) with the tensorboard-notf build.

This is a continuation of #1418 that adds robust test coverage and S3 support.

Tested S3 with

pip install boto3
pip install moto
vi ~/.aws/credentials # setup per https://stackoverflow.com/questions/33297172/boto3-error-botocore-exceptions-nocredentialserror-unable-to-locate-credential
bazel build tensorboard:tensorboard-notf --verbose_failures
./bazel-bin/tensorboard/tensorboard-notf --logdir ~/local/sample_runs
./bazel-bin/tensorboard/tensorboard-notf --logdir s3://escapes/tensorboard-notf/sample_runs/runs/Jul05_11-15-47_limichael-mbp
./bazel-bin/tensorboard/tensorboard-notf --logdir s3://escapes/tensorboard-notf/sample_runs # Note that this will be slow to load

Also can run the following tests

bazel test //tensorboard/compat/proto:proto_test --test_output=errors
bazel test //tensorboard/compat/tensorflow_stub:gfile_test --test_output=errors
bazel test //tensorboard/compat/tensorflow_stub:gfile_s3_test --test_output=errors

cc @nfelt and @lanpa

orionr · 2018-12-05T20:30:56Z

Still have some issues here - stay tuned.

orionr · 2018-12-06T20:25:33Z

And issues resolved - ready for commentary and review!

orionr · 2019-01-14T19:21:39Z

@nfelt (and others) can we get a review on this? It is blocking us for some things. Much appreciated.

nfelt

Apologies for the delay getting to this.

I did an initial pass on the code, but I have a higher-level concern about the filesystem concept being introduced - I think the TF gfile interface is too large of an abstraction for us to maintain within TensorBoard. The original PR just had a subset of the actual gfile, but this is starting to introduce new compatibility layers that I'm not comfortable maintaining without at least some test coverage, and there is currently none. My original intention was that the tensorflow_stub code would be eliminated over time in favor of building new tightly-focused abstractions for just the functionality used by TensorBoard.

Concretely, since TensorBoard is almost entirely read-only, we shouldn't need most of the stateful gfile operations (like copy or rename) at all. And I'm also not sure that we even need a "GFile" file object abstraction for reading from disk. From what I can see, basically all uses of such an abstraction could be done via a one-shot read_file_as_string() function, except for the "PyRecordReader" case if we make it more efficient as described - and even that could probably use something like read_file_as_chunked_strings() returning an iterator of fixed-size chunks.

I think that's the right path forward here, rather than doubling down on gfile. I can flesh this out a little in a quick doc or GH issue if that sounds workable to you.

tensorboard/compat/tensorflow_stub/gfile.py

tensorboard/backend/event_processing/io_wrapper.py

tensorboard/compat/tensorflow_stub/gfile.py

nfelt · 2019-01-14T22:40:10Z

tensorboard/compat/tensorflow_stub/gfile.py

+        if not self.isdir(dirname):
+            raise errors.NotFoundError(None, None, "Could not find directory")
+
+        # Handle the bytes vs str problem on different OSes


We should define our own list_directory() to have better unicode semantics than os.listdir(). I'd prefer to just always return unicode strings and either require that dirname be given as a unicode string, or perhaps autodecode it as UTF-8 if it's not (e.g. use compat.as_text()).

tensorboard/compat/tensorflow_stub/pywrap_tensorflow.py

tensorboard/compat/tensorflow_stub/gfile.py

tensorboard/compat/tensorflow_stub/pywrap_tensorflow.py

nfelt · 2019-01-15T00:36:26Z

tensorboard/compat/tensorflow_stub/pywrap_tensorflow.py

-                # assert header_crc_calc == crc_header[0], \
-                #     'Header crc\'s dont match'
+                n += 8
+                crc_header_str = buf[n:n+4]


The CRC calculation for the header and the body is the same - it should be factored out into a helper method like read_and_check(buf, offset, num_bytes) -> (data, new_offset).

orionr · 2019-01-15T15:47:10Z

@nfelt, great feedback. I agree that simplifying to a read-only (or near read-only) API makes a lot of sense. That being said, are you open to landing this with changes before we simplify things further? Let me know. In the meantime, I'll see if I can address the comments above - thanks.

nfelt · 2019-01-16T00:56:10Z

From an initial audit, I think the minimal subset of the gfile API needed for basic TB functionality would be implementations of:

exists()
glob()
isdir()
listdir()
stat()
walk()
read_file_to_string()

Scoping it down to just this would help a lot and I think should still get you what you need - it wouldn't support the projector plugin for now which makes use of gfile.GFile to read file contents, but I'd prefer not to generalize GFile now only to later get rid of it entirely.

The other main concern is testing: if we're adding further code to the stub, it needs test coverage - submitting the stub code without tests was a special case since it was largely derived from TF's code, but not something we can continue doing. (As an aside, we also still need at least some smoke testing that TB works in no-TF mode, since I realized that a recent change migrated our gfile API calls from tf.gfile to the new tf.io.gfile with different symbol names, so the no-TF build is presumably non-functional right now.)

If we can tackle those two within the PR I'm open to merging, though I still think it may be more efficient overall to identify a better medium-term solution and then migrate directly towards that.

orionr · 2019-01-17T16:15:51Z

I agree - unit tests are a big gap right now. If you have any pointers on standard practice, please pass them my way. This will take me a bit longer, but worth the investment.

nfelt · 2019-01-24T01:17:12Z

Thanks! I'd say existing unit tests in the codebase are our best example of standard practice, e.g. something like this test for the io_wrapper functionality: https://github.com/tensorflow/tensorboard/blob/a67fc6c27e5c859eb7135e005eda5259b109638b/tensorboard/backend/event_processing/io_wrapper_test.py

I'm not sure what the best way to test the boto3 API calls would be. From a quick search I found https://github.com/spulec/moto which looked reasonable at first glance.

orionr · 2019-01-24T01:42:19Z

I've cleaned up the GFile interface to be primarily read-only and matching the new tf.io.gfile location. I was also able to unify read to one location and am now looking at chunking / buffering. Once I have that I'll push and then work on tests. Thanks for the followup!

lanpa · 2019-01-24T13:36:27Z

FYI: some thing you might need to use moto. (export... and os.environment(...
lanpa/tensorboardX@14e35a1

orionr · 2019-01-31T03:44:10Z

Rebased although it looks like some of the refactors might not have played nice... @nfelt should I pull some of this out of the PR so we can land incrementally? In any case, we now have a robust test suite here.

orionr · 2019-02-01T01:17:05Z

All tests pass. @nfelt, can we see how to land this sooner rather than later? Seems like things are changing underneath, so hopefully we can get some good tests in place. Thank you!

stephanwlee

I just set up a Windows machine and the tests seems to break. It may be because I have a wrong set up though. One of the failures read:

  File ".../gifle.py", line 344, in _fill_buffer_to
    self.length = fs.stat(self.filename).length
AttributeError: 'NoneType' object has no attribute 'length'

stephanwlee · 2019-02-06T16:47:25Z

.travis.yml

@@ -125,6 +130,8 @@ script:
  - bazel fetch //tensorboard/...
  - bazel build //tensorboard/...
  - bazel test //tensorboard/...
+  # Run manual S3 test
+  - bazel test //tensorboard/compat/tensorflow_stub:gfile_s3_test


Can we achieve this by removing manual on the S3 test?

This test needs to be manual since it requires boto3 and we likely shouldn't require that everyone install it. @nfelt, thoughts?

tensorboard/compat/tensorflow_stub/__init__.py

stephanwlee · 2019-02-06T17:14:40Z

tensorboard/compat/tensorflow_stub/io/gfile.py

+
+    for subdir in subdirs:
+        try:
+            joined_subdir = os.path.join(top, subdir)


Sorry for my unfamiliarity with boto but does it handle WIndow's op.path.sep?

nfelt

Thanks again for adding tests and updating after the rebase, it's looking really good. Comments are mostly small code-level things that I think should be straightforward to address.

tensorboard/compat/proto/BUILD

tensorboard/compat/tensorflow_stub/__init__.py

tensorboard/compat/__init__.py

tensorboard/compat/tensorflow_stub/compat/v1/__init__.py

tensorboard/compat/tensorflow_stub/io/__init__.py

tensorboard/compat/tensorflow_stub/io/gfile.py

nfelt · 2019-02-08T11:35:13Z

tensorboard/compat/tensorflow_stub/io/gfile_s3_test.py

+
+from tensorboard.compat.tensorflow_stub.io import gfile
+
+os.environ.setdefault("AWS_ACCESS_KEY_ID", "foobar_key")


Are these just needed as placeholders? Presumably the actual values don't matter? A comment of some sort would be useful.

The comment says that these overwrite any local keys, but the code
actually keeps the local keys if they’re there, only setting the
placeholders if the user has no keys. Am I missing something? Shouldn’t
these just be

os.environ["AWS_ACCESS_KEY_ID"] = "foobar_key" os.environ["AWS_SECRET_ACCESS_KEY"] = "foobar_secret"

tensorboard/compat/tensorflow_stub/io/gfile_s3_test.py

tensorboard/plugins/audio/BUILD

orionr · 2019-02-12T17:33:07Z

@nfelt should have addressed all your concerns here. Let me know if I missed anything. Also, moved plugin tests themselves (so we can discuss more) to #1829 . Thanks.

orionr · 2019-02-13T16:18:18Z

tensorboard/compat/tensorflow_stub/io/gfile.py

+            raise NotImplementedError(
+                "{} not supported by compat glob".format(filename))
+        if star_i != len(filename) - 1:
+            # Just return empty so we can use glob from directory watcher


I can see your point here. However, we want to use this for our internal distributed filesystem as well and doing a check for S3 in GetLogdirSubdirectories wouldn't be sufficient. Can we put a TODO here instead of making the larger change?

tensorboard/compat/tensorflow_stub/pywrap_tensorflow.py

orionr · 2019-02-13T16:20:31Z

@stephanwlee can you test this on Windows? Much appreciated.

orionr · 2019-02-20T00:43:37Z

@stephanwlee confirmed this passed on Windows with all changes. @nfelt do you need anything from me to land this? Thanks.

nfelt

Thanks for your patience. Just a few remaining issues and then I'll merge this.

tensorboard/compat/tensorflow_stub/io/gfile.py

tensorboard/backend/event_processing/event_accumulator_test.py

tensorboard/plugins/projector/projector_plugin_test.py

nfelt · 2019-02-18T07:13:11Z

tensorboard/compat/tensorflow_stub/io/gfile.py

+# See the License for the specific language governing permissions and
+# limitations under the License.
+# ==============================================================================
+"""File IO methods that wrap the C++ FileSystem API.


Can you update the docstring here? It can be as simple as something like "Python implementation of tf.io.gfile supporting multiple file systems."

nfelt · 2019-02-20T08:57:59Z

tensorboard/compat/tensorflow_stub/io/gfile.py

+            raise NotImplementedError(
+                "{} not supported by compat glob".format(filename))
+        if star_i != len(filename) - 1:
+            # Just return empty so we can use glob from directory watcher


Ok, let's put a TODO.

orionr · 2019-02-22T15:58:13Z

All changes should have been made! @nfelt please take a look and thanks.

nfelt · 2019-02-22T19:19:00Z

Merged! Thanks for bearing with us.

orionr changed the title ~~Support S3 and other filesystems with tensorboard-notf~~ [WIP] Support S3 and other filesystems with tensorboard-notf Dec 5, 2018

orionr force-pushed the add-s3-for-notf branch 2 times, most recently from 9be2718 to a3dd9d4 Compare December 6, 2018 20:24

orionr changed the title ~~[WIP] Support S3 and other filesystems with tensorboard-notf~~ Support S3 and other filesystems with tensorboard-notf Dec 6, 2018

orionr force-pushed the add-s3-for-notf branch from a3dd9d4 to 064679d Compare December 6, 2018 20:28

googlebot added the cla: yes label Dec 14, 2018

stephanwlee removed the cla: yes label Dec 14, 2018

orionr force-pushed the add-s3-for-notf branch 2 times, most recently from 1d0b91b to 6e66248 Compare December 20, 2018 23:05

nfelt suggested changes Jan 15, 2019

View reviewed changes

orionr force-pushed the add-s3-for-notf branch 4 times, most recently from afb65bc to dee44eb Compare January 31, 2019 03:39

orionr changed the title ~~Support S3 and other filesystems with tensorboard-notf~~ Add tests and support S3 with tensorboard-notf Jan 31, 2019

orionr force-pushed the add-s3-for-notf branch from af0e2e0 to 623eb58 Compare February 1, 2019 00:16

orionr mentioned this pull request Feb 1, 2019

Standard TensorBoard build handles no TensorFlow #1796

Merged

This was referenced Feb 2, 2019

[WIP] tensorboard-notf for pytorch #1799

Closed

[WIP] tensorboard-notf for pytorch (re-submit of #1799) #1801

Open

stephanwlee reviewed Feb 7, 2019

View reviewed changes

nfelt reviewed Feb 8, 2019

View reviewed changes

orionr force-pushed the add-s3-for-notf branch from 623eb58 to b633513 Compare February 12, 2019 17:25

orionr mentioned this pull request Feb 12, 2019

Add plugin and compat tests for tensorboard-notf #1829

Merged

Support S3 and other filesystems with tensorboard-notf

f6bf852

orionr force-pushed the add-s3-for-notf branch from b633513 to f6bf852 Compare February 12, 2019 18:29

orionr commented Feb 13, 2019

View reviewed changes

orionr added 2 commits February 14, 2019 09:18

Some Windows and boundary condition fixes

4596fde

One more Windows fix

d230cc5

nfelt approved these changes Feb 20, 2019

View reviewed changes

orionr added 4 commits February 21, 2019 16:55

Code review changes

c7ed9b9

Revert some changes

6161244

Use tuple instead of list

2ceb921

Add TODO note

3ef3d88

nfelt approved these changes Feb 22, 2019

View reviewed changes

nfelt merged commit a493612 into tensorflow:master Feb 22, 2019

nfelt mentioned this pull request May 21, 2020

Rid the codebase of tensorboard.compat.tf #3666

Open

13 tasks


		from tensorboard.compat.tensorflow_stub.io import gfile

		os.environ.setdefault("AWS_ACCESS_KEY_ID", "foobar_key")

Add tests and support S3 with tensorboard-notf #1663

Add tests and support S3 with tensorboard-notf #1663

Uh oh!

Conversation

orionr commented Dec 5, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

orionr commented Dec 5, 2018

Uh oh!

orionr commented Dec 6, 2018

Uh oh!

orionr commented Jan 14, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

nfelt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

orionr commented Jan 15, 2019

Uh oh!

nfelt commented Jan 16, 2019

Uh oh!

orionr commented Jan 17, 2019

Uh oh!

nfelt commented Jan 24, 2019

Uh oh!

orionr commented Jan 24, 2019

Uh oh!

lanpa commented Jan 24, 2019

Uh oh!

orionr commented Jan 31, 2019

Uh oh!

orionr commented Feb 1, 2019

Uh oh!

stephanwlee left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nfelt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

orionr commented Feb 12, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

orionr commented Feb 13, 2019

Uh oh!

orionr commented Feb 20, 2019

Uh oh!

orionr commented Dec 5, 2018 •

edited

Loading

orionr commented Jan 14, 2019 •

edited

Loading