Skip to content

Improve capabilities of CNV collections classes. #5716

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
samuelklee opened this issue Feb 25, 2019 · 8 comments
Closed

Improve capabilities of CNV collections classes. #5716

samuelklee opened this issue Feb 25, 2019 · 8 comments

Comments

@samuelklee
Copy link
Contributor

samuelklee commented Feb 25, 2019

In order of priority:

  1. The ability to query and/or stream intervals for locatable collections might reduce the overhead of file localization in the germline workflows---even though we only run GermlineCNVCaller on a subset of intervals in any particular shard, we localize the entire read-count file. This could be enabled in the parent class to benefit all locatable files, but since it will probably require indexing, we should use only when necessary.
  2. Memory requirements for some tools could be reduced by avoiding intermediate creation of an internally held list, streaming it directly instead.
  3. NIO streaming of entire files to/from buckets could be easily added to the relevant CSV/HDF5 read/write classes. Apart from the first issue, I don't think this really adds much, since the largest files are only ~1GB (and most seg files are much smaller) and are typically cheap to localize for single samples.

See also #3976, #4004, #4717, and #5715 for context.

I think we should first demonstrate if the first issue is really the dominating cost in the germline pipeline. If not, we should first focus on optimizing inference. The other issues are much lower priority.

@samuelklee
Copy link
Contributor Author

samuelklee commented Feb 28, 2019

@vruano @mwalker174 Any estimates on cost that could help determine the priority of issue 1? Specifically, is the disk cost required to localize the entire count file for all samples a determining factor? If we can drastically reduce this cost, then we can dedicate more to increasing resolution, etc.

Here is a minimal set of fixes that could enable the querying of intervals for GermlineCNVCaller (and also for DetermineGermlineContigPloidy without too much extra work, since we also subset intervals there) only in the gCNV WGS pipeline, without disrupting other interfaces:

  1. Write a Tribble SimpleCountCodec for the counts.tsv extension. I've already done this in a branch.
  2. Change GermlineCNVCaller and DetermineGermlineContigPloidy tools to accept paths.
  3. If an index is present for each count path, create a FeatureDataSource, merge the requested -L/-XL intervals, and query to perform the subset. We will also need to stream the SAM header metadata. It should not require much code to extract all this to a temporary IndexedSimpleCountCollection class. (Caveat: for now, this will work with the current gCNV convention of providing bins via -L/-XL. Technically, it will also work with the more conventional use of -L/-XL to denote contiguous regions, but we may have to perform checks that bins are not duplicated in adjacent shards if they overlap both, since querying a FeatureDataSource will return any bins that overlap the interval---rather than only those that are completely contained within it.)
  4. Index read-count TSVs in the gCNV WGS pipeline after collection and modify the DetermineGermlineContigPloidy and GermlineCNVCaller tasks to take read-count paths and indices, if necessary. These changes could be confined in the gCNV WGS WDL for now.

I think that should do the trick. If this is high priority, I can implement now.

In the future, we might be able to promote all Locatable CNV Records to Features and write code to automatically pass the columns/encoders/decoders (currently held in the Collection corresponding to each Record) to a single Tribble codec. This codec should not depend upon the file extension.

@samuelklee
Copy link
Contributor Author

Hacked together a proof-of-principle in sl_indexed_counts. This enables streaming of indexed counts (when they are specified by a bucket, although we should change this to trigger on the presence of an index) only for the requested intervals in the gCNV step, which would cut disk costs to practically zero. We could enable this in the contig-ploidy step as well, although the cost benefit is not as great there since the tool is short running.

The relevant code clocks in at around +250 lines and could be even more minimal after some cleanup and extraction, but we'd need to update documentation and the WDLs. All previous behavior is preserved.

@samuelklee
Copy link
Contributor Author

@mwalker174 is this something we want to address before my leave?

@mwalker174
Copy link
Contributor

@samuelklee Let's try to get your sl_indexed_counts branch in since it will have a substantial impact on the cost of the clinical SV pipeline. It's fine if it just streams from a bucket, but we do want to update the WDLs to take advantage.

@samuelklee
Copy link
Contributor Author

@mwalker174 Maybe a note here when you get the estimate for disk costs would be useful for posterity?

@mwalker174
Copy link
Contributor

In the SV pipeline, for ~100 samples in case mode the cost breakdown was:

memory: $1.04
disk: $1.50
cpu: $4.59

It seems that we are over-provisioning for disk: amount of disk size used was < 5 GB (only ~$0.14 worth of disk) but the default is 150 gb. Therefore it doesn't seem that we would save much by streaming, but we should lower the default disk size in the WDL.

@samuelklee
Copy link
Contributor Author

As we just discussed at our gCNV meeting, this disk cost is for 2kbp bins. So once 100bp bins are back on the table, we'll be at 20 x $0.14 = $2.80. In the meantime, no reason not to recoup that ~$1.36! Maybe even dump it into memory and decrease the number of shards accordingly?

For now, we should probably just set the appropriate runtime parameter in the clinical SV pipeline, but then at some point (maybe when we update to WDL 1.0) go back and put some more intelligent disk-size estimation into the CNV WDLs. At some point Spec Ops did this for the somatic WDLs (where it probably doesn't make too much of a difference), but we didn't really do anything special for the gCNV WDLs.

@mwalker174
Copy link
Contributor

The bulk of this issue was addressed in #6266. I believe @samuelklee also did some detailed profiling and determined that the memory bottlenecks were inference-related.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants