-
Notifications
You must be signed in to change notification settings - Fork 1.7k
Add tests and support S3 with tensorboard-notf #1663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Still have some issues here - stay tuned. |
9be2718
to
a3dd9d4
Compare
And issues resolved - ready for commentary and review! |
a3dd9d4
to
064679d
Compare
1d0b91b
to
6e66248
Compare
@nfelt (and others) can we get a review on this? It is blocking us for some things. Much appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Apologies for the delay getting to this.
I did an initial pass on the code, but I have a higher-level concern about the filesystem
concept being introduced - I think the TF gfile
interface is too large of an abstraction for us to maintain within TensorBoard. The original PR just had a subset of the actual gfile, but this is starting to introduce new compatibility layers that I'm not comfortable maintaining without at least some test coverage, and there is currently none. My original intention was that the tensorflow_stub
code would be eliminated over time in favor of building new tightly-focused abstractions for just the functionality used by TensorBoard.
Concretely, since TensorBoard is almost entirely read-only, we shouldn't need most of the stateful gfile operations (like copy or rename) at all. And I'm also not sure that we even need a "GFile" file object abstraction for reading from disk. From what I can see, basically all uses of such an abstraction could be done via a one-shot read_file_as_string() function, except for the "PyRecordReader" case if we make it more efficient as described - and even that could probably use something like read_file_as_chunked_strings() returning an iterator of fixed-size chunks.
I think that's the right path forward here, rather than doubling down on gfile. I can flesh this out a little in a quick doc or GH issue if that sounds workable to you.
if not self.isdir(dirname): | ||
raise errors.NotFoundError(None, None, "Could not find directory") | ||
|
||
# Handle the bytes vs str problem on different OSes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should define our own list_directory() to have better unicode semantics than os.listdir()
. I'd prefer to just always return unicode strings and either require that dirname be given as a unicode string, or perhaps autodecode it as UTF-8 if it's not (e.g. use compat.as_text()
).
# assert header_crc_calc == crc_header[0], \ | ||
# 'Header crc\'s dont match' | ||
n += 8 | ||
crc_header_str = buf[n:n+4] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The CRC calculation for the header and the body is the same - it should be factored out into a helper method like read_and_check(buf, offset, num_bytes) -> (data, new_offset)
.
@nfelt, great feedback. I agree that simplifying to a read-only (or near read-only) API makes a lot of sense. That being said, are you open to landing this with changes before we simplify things further? Let me know. In the meantime, I'll see if I can address the comments above - thanks. |
From an initial audit, I think the minimal subset of the gfile API needed for basic TB functionality would be implementations of:
Scoping it down to just this would help a lot and I think should still get you what you need - it wouldn't support the projector plugin for now which makes use of gfile.GFile to read file contents, but I'd prefer not to generalize GFile now only to later get rid of it entirely. The other main concern is testing: if we're adding further code to the stub, it needs test coverage - submitting the stub code without tests was a special case since it was largely derived from TF's code, but not something we can continue doing. (As an aside, we also still need at least some smoke testing that TB works in no-TF mode, since I realized that a recent change migrated our gfile API calls from tf.gfile to the new tf.io.gfile with different symbol names, so the no-TF build is presumably non-functional right now.) If we can tackle those two within the PR I'm open to merging, though I still think it may be more efficient overall to identify a better medium-term solution and then migrate directly towards that. |
I agree - unit tests are a big gap right now. If you have any pointers on standard practice, please pass them my way. This will take me a bit longer, but worth the investment. |
Thanks! I'd say existing unit tests in the codebase are our best example of standard practice, e.g. something like this test for the io_wrapper functionality: https://github.com/tensorflow/tensorboard/blob/a67fc6c27e5c859eb7135e005eda5259b109638b/tensorboard/backend/event_processing/io_wrapper_test.py I'm not sure what the best way to test the boto3 API calls would be. From a quick search I found https://github.com/spulec/moto which looked reasonable at first glance. |
I've cleaned up the GFile interface to be primarily read-only and matching the new |
FYI: some thing you might need to use moto. ( |
afb65bc
to
dee44eb
Compare
Rebased although it looks like some of the refactors might not have played nice... @nfelt should I pull some of this out of the PR so we can land incrementally? In any case, we now have a robust test suite here. |
af0e2e0
to
623eb58
Compare
All tests pass. @nfelt, can we see how to land this sooner rather than later? Seems like things are changing underneath, so hopefully we can get some good tests in place. Thank you! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just set up a Windows machine and the tests seems to break. It may be because I have a wrong set up though. One of the failures read:
File ".../gifle.py", line 344, in _fill_buffer_to
self.length = fs.stat(self.filename).length
AttributeError: 'NoneType' object has no attribute 'length'
.travis.yml
Outdated
@@ -125,6 +130,8 @@ script: | |||
- bazel fetch //tensorboard/... | |||
- bazel build //tensorboard/... | |||
- bazel test //tensorboard/... | |||
# Run manual S3 test | |||
- bazel test //tensorboard/compat/tensorflow_stub:gfile_s3_test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we achieve this by removing manual on the S3 test?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test needs to be manual since it requires boto3 and we likely shouldn't require that everyone install it. @nfelt, thoughts?
|
||
for subdir in subdirs: | ||
try: | ||
joined_subdir = os.path.join(top, subdir) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for my unfamiliarity with boto but does it handle WIndow's op.path.sep
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again for adding tests and updating after the rebase, it's looking really good. Comments are mostly small code-level things that I think should be straightforward to address.
|
||
from tensorboard.compat.tensorflow_stub.io import gfile | ||
|
||
os.environ.setdefault("AWS_ACCESS_KEY_ID", "foobar_key") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these just needed as placeholders? Presumably the actual values don't matter? A comment of some sort would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment says that these overwrite any local keys, but the code
actually keeps the local keys if they’re there, only setting the
placeholders if the user has no keys. Am I missing something? Shouldn’t
these just be
os.environ["AWS_ACCESS_KEY_ID"] = "foobar_key"
os.environ["AWS_SECRET_ACCESS_KEY"] = "foobar_secret"
623eb58
to
b633513
Compare
b633513
to
f6bf852
Compare
raise NotImplementedError( | ||
"{} not supported by compat glob".format(filename)) | ||
if star_i != len(filename) - 1: | ||
# Just return empty so we can use glob from directory watcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see your point here. However, we want to use this for our internal distributed filesystem as well and doing a check for S3 in GetLogdirSubdirectories wouldn't be sufficient. Can we put a TODO here instead of making the larger change?
@stephanwlee can you test this on Windows? Much appreciated. |
@stephanwlee confirmed this passed on Windows with all changes. @nfelt do you need anything from me to land this? Thanks. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your patience. Just a few remaining issues and then I'll merge this.
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
# ============================================================================== | ||
"""File IO methods that wrap the C++ FileSystem API. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you update the docstring here? It can be as simple as something like "Python implementation of tf.io.gfile supporting multiple file systems."
raise NotImplementedError( | ||
"{} not supported by compat glob".format(filename)) | ||
if star_i != len(filename) - 1: | ||
# Just return empty so we can use glob from directory watcher |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, let's put a TODO.
All changes should have been made! @nfelt please take a look and thanks. |
Merged! Thanks for bearing with us. |
As a part of building out tensorboard-notf, we need support for S3 (and our internal filesystems) with the tensorboard-notf build.
This is a continuation of #1418 that adds robust test coverage and S3 support.
Tested S3 with
Also can run the following tests
cc @nfelt and @lanpa