Skip to content

Add in-memory I/O #417

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from
Closed

Conversation

pkrusche
Copy link

These changes allow to do in-memory I/O as described during the last GA4GH fileformats meeting.

The idea is to add a backend which can read/write in-memory buffers (e.g. for multi-threading or to work around file handle limits).

Here are slides some explaining how this works and what it is useful for:

https://docs.google.com/presentation/d/1xTcTFrWuDYqMgSIEk3qEucsRGH2JaElgpYeYd9IGvbE/edit?usp=sharing

It would be great to have this functionality bundled with htslib!


if(!fp->buffer_is_mine)
{
fprintf(stderr, "[E::mem_file] Cannot write to %s -- I don't own the buffer and can only read.\n",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This low level of htslib reports errors via errno rather than polluting stderr. So use errno = EINVAL; or similar, and maybe log the explanation with if (hts_verbose >= 5) or so if that failure is going to be confusing to the programmer.

{
new_buffer_size = (fp->offset + nbytes + 1023) & round_mask;
tmp = realloc(fp->buffer, new_buffer_size) ;
if(!tmp)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, realloc failure has already set errno to ENOMEM, so all this needs is if (tmp == NULL) return -1;.

}

static const struct hFILE_backend mem_backend = {
mem_read, mem_write, mem_seek, mem_flush, mem_close
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can just put NULL in the flush slot instead of having a dummy mem_flush() function.

size_t len;

const char *realfilename = strchr(filename, ':') + 1;
if(!realfilename)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Too late! You've already added 1 to NULL.

@jmarshall
Copy link
Member

Thanks for sharing the code, the details make what you were describing at the formats meeting clearer 😄 I've made a few specific comments on the code, but here's some more high-level comments.

It appears that hopen_mem() adds support for three separate things:

  1. The use-externally-provided-memory-buffer @filename functionality discussed at the meeting;
  2. Reading in a local file as per the hFILE_fd backend, but loading the whole file into memory during hopen() rather than holding a file descriptor open while the resulting hFILE is open;
  3. Creating a new file in a memory buffer (though this seems incomplete — I half-expected it to write the file out during mem_close(), but perhaps this is to be used with hfile_mem_get_buffer()?).

For another backend (data:), I'm considering adding support to base hFILE's buffering for a “fixed buffer” wherein the base hFILE's buffer/begin/end/limit contains the entire file contents rather than using backend read/write methods to fill and empty it. This simplifies some things and saves pointlessly memcopying between the backend in-memory buffer and the base buffer. So if hFILE_fd gained an option to do this, that would provide the (2) and possibly also (3) functionality, leaving a new hFILE_mem free to concentrate on (1).

@pkrusche
Copy link
Author

pkrusche commented Oct 7, 2016

Thanks for the feedback! I have added some changes based on your code review, please let me know if these are ok or need more work.

About the use cases:

(2) / reading a file into memory and closing the handle could probably be removed from this would duplicate the data backend functionality.

(3) The idea is to retrieve the buffer afterwards and not write things to disk -- the application for this is to have multiple threads write parts of a VCF/BCF file which stay in memory and can then be concatenated and written out in order as they become available.

@jmarshall
Copy link
Member

The “fixed buffer” construct and a varargs form of hopen() have since landed in HTSlib. This means that the functionality desired in this PR could be implemented perhaps more elegantly and maintainably using these facilities. Thus:

  1. Open a file using an externally-provided-memory-buffer via something like

     hopen("mem:", "r:", "buffer", ptr, len)
    

    which would wrap a fixed-buffer hFILE around the provided buffer.

  2. Preloading a file could be arranged with hopen(filename, "r:", "preload", 1) or so, which would tell hopen_fd (or a new function alongside it) to create a fixed buffer instead of the current mobile buffer.

These two I think provide a better interface to functionality similar to the PR's proposed lookup_function stuff.

  1. Retrieving a buffer after using an hFILE to write to it: it seems like the best way to implement this would be to fully implement writing for fixed buffers. Then something like

     hopen("mem:", "w:", "buffer-ref", &ptr, &len)
    

would tell the fixed-buffer hFILE to (also) update these pointers to the backing buffer; or an ioctl-like function like hfile_mem_get_buffer() could report them after the fact — it's unclear which is uglier. There are buffer ownership issues to be worked out — should the buffer be observable all the time, or only after hflush(), or only after hclose()?

@pkrusche
Copy link
Author

pkrusche commented Jun 7, 2017

In terms of observing the buffer, I think that while ugly, a solution with hfile_mem_get_buffer might be a good solution -- this function could make sure the buffer is observable until the next write (when it might need to be reallocated).

Maybe having buffers as separate entities would solve the question of ownership, i.e. having something like hfile_mem_free_buffer to free the buffer after the file was closed (s.t. users can call hts_close, which might do non-trivial flushing / writing before calling hclose?). Another idea could be a hclose callback that defaults to free()?

@jkbonfield
Copy link
Contributor

I haven't investigated this yet so maybe it's already there, but also consider the setvbuf interface from stdio. Normally fopen/fclose allocates and deallocates its own buffers, but it is possible to supply your own in which case memory management is up to you.

@ThomasHickman
Copy link
Contributor

Hi @pkrusche, I'm working on something that could do with this in htslib. What's the status of this pull request and is there anything that I could do to help with getting this merged?

ThomasHickman added a commit to wtsi-hgi/htslib that referenced this pull request Sep 8, 2017
Implements what is proposed in samtools#417
ThomasHickman added a commit to wtsi-hgi/htslib that referenced this pull request Sep 11, 2017
Implements what is proposed in samtools#417
@jkbonfield
Copy link
Contributor

jkbonfield commented Dec 14, 2017

Closed as replaced by #590 (now merged).

@jkbonfield jkbonfield closed this Dec 14, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants