Skip to content

Add tests and support S3 with tensorboard-notf #1663

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Feb 22, 2019

Conversation

orionr
Copy link
Contributor

@orionr orionr commented Dec 5, 2018

As a part of building out tensorboard-notf, we need support for S3 (and our internal filesystems) with the tensorboard-notf build.

This is a continuation of #1418 that adds robust test coverage and S3 support.

Tested S3 with

pip install boto3
pip install moto
vi ~/.aws/credentials # setup per https://stackoverflow.com/questions/33297172/boto3-error-botocore-exceptions-nocredentialserror-unable-to-locate-credential
bazel build tensorboard:tensorboard-notf --verbose_failures
./bazel-bin/tensorboard/tensorboard-notf --logdir ~/local/sample_runs
./bazel-bin/tensorboard/tensorboard-notf --logdir s3://escapes/tensorboard-notf/sample_runs/runs/Jul05_11-15-47_limichael-mbp
./bazel-bin/tensorboard/tensorboard-notf --logdir s3://escapes/tensorboard-notf/sample_runs # Note that this will be slow to load

Also can run the following tests

bazel test //tensorboard/compat/proto:proto_test --test_output=errors
bazel test //tensorboard/compat/tensorflow_stub:gfile_test --test_output=errors
bazel test //tensorboard/compat/tensorflow_stub:gfile_s3_test --test_output=errors

cc @nfelt and @lanpa

@orionr orionr changed the title Support S3 and other filesystems with tensorboard-notf [WIP] Support S3 and other filesystems with tensorboard-notf Dec 5, 2018
@orionr
Copy link
Contributor Author

orionr commented Dec 5, 2018

Still have some issues here - stay tuned.

@orionr orionr force-pushed the add-s3-for-notf branch 2 times, most recently from 9be2718 to a3dd9d4 Compare December 6, 2018 20:24
@orionr orionr changed the title [WIP] Support S3 and other filesystems with tensorboard-notf Support S3 and other filesystems with tensorboard-notf Dec 6, 2018
@orionr
Copy link
Contributor Author

orionr commented Dec 6, 2018

And issues resolved - ready for commentary and review!

@orionr orionr force-pushed the add-s3-for-notf branch 2 times, most recently from 1d0b91b to 6e66248 Compare December 20, 2018 23:05
@orionr
Copy link
Contributor Author

orionr commented Jan 14, 2019

@nfelt (and others) can we get a review on this? It is blocking us for some things. Much appreciated.

Copy link
Contributor

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apologies for the delay getting to this.

I did an initial pass on the code, but I have a higher-level concern about the filesystem concept being introduced - I think the TF gfile interface is too large of an abstraction for us to maintain within TensorBoard. The original PR just had a subset of the actual gfile, but this is starting to introduce new compatibility layers that I'm not comfortable maintaining without at least some test coverage, and there is currently none. My original intention was that the tensorflow_stub code would be eliminated over time in favor of building new tightly-focused abstractions for just the functionality used by TensorBoard.

Concretely, since TensorBoard is almost entirely read-only, we shouldn't need most of the stateful gfile operations (like copy or rename) at all. And I'm also not sure that we even need a "GFile" file object abstraction for reading from disk. From what I can see, basically all uses of such an abstraction could be done via a one-shot read_file_as_string() function, except for the "PyRecordReader" case if we make it more efficient as described - and even that could probably use something like read_file_as_chunked_strings() returning an iterator of fixed-size chunks.

I think that's the right path forward here, rather than doubling down on gfile. I can flesh this out a little in a quick doc or GH issue if that sounds workable to you.

if not self.isdir(dirname):
raise errors.NotFoundError(None, None, "Could not find directory")

# Handle the bytes vs str problem on different OSes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should define our own list_directory() to have better unicode semantics than os.listdir(). I'd prefer to just always return unicode strings and either require that dirname be given as a unicode string, or perhaps autodecode it as UTF-8 if it's not (e.g. use compat.as_text()).

# assert header_crc_calc == crc_header[0], \
# 'Header crc\'s dont match'
n += 8
crc_header_str = buf[n:n+4]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CRC calculation for the header and the body is the same - it should be factored out into a helper method like read_and_check(buf, offset, num_bytes) -> (data, new_offset).

@orionr
Copy link
Contributor Author

orionr commented Jan 15, 2019

@nfelt, great feedback. I agree that simplifying to a read-only (or near read-only) API makes a lot of sense. That being said, are you open to landing this with changes before we simplify things further? Let me know. In the meantime, I'll see if I can address the comments above - thanks.

@nfelt
Copy link
Contributor

nfelt commented Jan 16, 2019

From an initial audit, I think the minimal subset of the gfile API needed for basic TB functionality would be implementations of:

  • exists()
  • glob()
  • isdir()
  • listdir()
  • stat()
  • walk()
  • read_file_to_string()

Scoping it down to just this would help a lot and I think should still get you what you need - it wouldn't support the projector plugin for now which makes use of gfile.GFile to read file contents, but I'd prefer not to generalize GFile now only to later get rid of it entirely.

The other main concern is testing: if we're adding further code to the stub, it needs test coverage - submitting the stub code without tests was a special case since it was largely derived from TF's code, but not something we can continue doing. (As an aside, we also still need at least some smoke testing that TB works in no-TF mode, since I realized that a recent change migrated our gfile API calls from tf.gfile to the new tf.io.gfile with different symbol names, so the no-TF build is presumably non-functional right now.)

If we can tackle those two within the PR I'm open to merging, though I still think it may be more efficient overall to identify a better medium-term solution and then migrate directly towards that.

@orionr
Copy link
Contributor Author

orionr commented Jan 17, 2019

I agree - unit tests are a big gap right now. If you have any pointers on standard practice, please pass them my way. This will take me a bit longer, but worth the investment.

@nfelt
Copy link
Contributor

nfelt commented Jan 24, 2019

Thanks! I'd say existing unit tests in the codebase are our best example of standard practice, e.g. something like this test for the io_wrapper functionality: https://github.com/tensorflow/tensorboard/blob/a67fc6c27e5c859eb7135e005eda5259b109638b/tensorboard/backend/event_processing/io_wrapper_test.py

I'm not sure what the best way to test the boto3 API calls would be. From a quick search I found https://github.com/spulec/moto which looked reasonable at first glance.

@orionr
Copy link
Contributor Author

orionr commented Jan 24, 2019

I've cleaned up the GFile interface to be primarily read-only and matching the new tf.io.gfile location. I was also able to unify read to one location and am now looking at chunking / buffering. Once I have that I'll push and then work on tests. Thanks for the followup!

@lanpa
Copy link
Contributor

lanpa commented Jan 24, 2019

FYI: some thing you might need to use moto. (export... and os.environment(...
lanpa/tensorboardX@14e35a1

@orionr orionr force-pushed the add-s3-for-notf branch 4 times, most recently from afb65bc to dee44eb Compare January 31, 2019 03:39
@orionr
Copy link
Contributor Author

orionr commented Jan 31, 2019

Rebased although it looks like some of the refactors might not have played nice... @nfelt should I pull some of this out of the PR so we can land incrementally? In any case, we now have a robust test suite here.

@orionr orionr changed the title Support S3 and other filesystems with tensorboard-notf Add tests and support S3 with tensorboard-notf Jan 31, 2019
@orionr
Copy link
Contributor Author

orionr commented Feb 1, 2019

All tests pass. @nfelt, can we see how to land this sooner rather than later? Seems like things are changing underneath, so hopefully we can get some good tests in place. Thank you!

Copy link
Contributor

@stephanwlee stephanwlee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just set up a Windows machine and the tests seems to break. It may be because I have a wrong set up though. One of the failures read:

  File ".../gifle.py", line 344, in _fill_buffer_to
    self.length = fs.stat(self.filename).length
AttributeError: 'NoneType' object has no attribute 'length'

.travis.yml Outdated
@@ -125,6 +130,8 @@ script:
- bazel fetch //tensorboard/...
- bazel build //tensorboard/...
- bazel test //tensorboard/...
# Run manual S3 test
- bazel test //tensorboard/compat/tensorflow_stub:gfile_s3_test
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we achieve this by removing manual on the S3 test?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test needs to be manual since it requires boto3 and we likely shouldn't require that everyone install it. @nfelt, thoughts?


for subdir in subdirs:
try:
joined_subdir = os.path.join(top, subdir)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for my unfamiliarity with boto but does it handle WIndow's op.path.sep?

Copy link
Contributor

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for adding tests and updating after the rebase, it's looking really good. Comments are mostly small code-level things that I think should be straightforward to address.


from tensorboard.compat.tensorflow_stub.io import gfile

os.environ.setdefault("AWS_ACCESS_KEY_ID", "foobar_key")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these just needed as placeholders? Presumably the actual values don't matter? A comment of some sort would be useful.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says that these overwrite any local keys, but the code
actually keeps the local keys if they’re there, only setting the
placeholders if the user has no keys. Am I missing something? Shouldn’t
these just be

os.environ["AWS_ACCESS_KEY_ID"] = "foobar_key"
os.environ["AWS_SECRET_ACCESS_KEY"] = "foobar_secret"

@orionr
Copy link
Contributor Author

orionr commented Feb 12, 2019

@nfelt should have addressed all your concerns here. Let me know if I missed anything. Also, moved plugin tests themselves (so we can discuss more) to #1829 . Thanks.

raise NotImplementedError(
"{} not supported by compat glob".format(filename))
if star_i != len(filename) - 1:
# Just return empty so we can use glob from directory watcher
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see your point here. However, we want to use this for our internal distributed filesystem as well and doing a check for S3 in GetLogdirSubdirectories wouldn't be sufficient. Can we put a TODO here instead of making the larger change?

@orionr
Copy link
Contributor Author

orionr commented Feb 13, 2019

@stephanwlee can you test this on Windows? Much appreciated.

@orionr
Copy link
Contributor Author

orionr commented Feb 20, 2019

@stephanwlee confirmed this passed on Windows with all changes. @nfelt do you need anything from me to land this? Thanks.

Copy link
Contributor

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your patience. Just a few remaining issues and then I'll merge this.

# See the License for the specific language governing permissions and
# limitations under the License.
# ==============================================================================
"""File IO methods that wrap the C++ FileSystem API.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the docstring here? It can be as simple as something like "Python implementation of tf.io.gfile supporting multiple file systems."

raise NotImplementedError(
"{} not supported by compat glob".format(filename))
if star_i != len(filename) - 1:
# Just return empty so we can use glob from directory watcher
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, let's put a TODO.

@orionr
Copy link
Contributor Author

orionr commented Feb 22, 2019

All changes should have been made! @nfelt please take a look and thanks.

@nfelt nfelt merged commit a493612 into tensorflow:master Feb 22, 2019
@nfelt
Copy link
Contributor

nfelt commented Feb 22, 2019

Merged! Thanks for bearing with us.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants