Improve capabilities of CNV collections classes. #5716

samuelklee · 2019-02-25T16:24:50Z

In order of priority:

The ability to query and/or stream intervals for locatable collections might reduce the overhead of file localization in the germline workflows---even though we only run GermlineCNVCaller on a subset of intervals in any particular shard, we localize the entire read-count file. This could be enabled in the parent class to benefit all locatable files, but since it will probably require indexing, we should use only when necessary.
Memory requirements for some tools could be reduced by avoiding intermediate creation of an internally held list, streaming it directly instead.
NIO streaming of entire files to/from buckets could be easily added to the relevant CSV/HDF5 read/write classes. Apart from the first issue, I don't think this really adds much, since the largest files are only ~1GB (and most seg files are much smaller) and are typically cheap to localize for single samples.

See also #3976, #4004, #4717, and #5715 for context.

I think we should first demonstrate if the first issue is really the dominating cost in the germline pipeline. If not, we should first focus on optimizing inference. The other issues are much lower priority.

samuelklee · 2019-02-28T17:22:46Z

@vruano @mwalker174 Any estimates on cost that could help determine the priority of issue 1? Specifically, is the disk cost required to localize the entire count file for all samples a determining factor? If we can drastically reduce this cost, then we can dedicate more to increasing resolution, etc.

Here is a minimal set of fixes that could enable the querying of intervals for GermlineCNVCaller (and also for DetermineGermlineContigPloidy without too much extra work, since we also subset intervals there) only in the gCNV WGS pipeline, without disrupting other interfaces:

Write a Tribble SimpleCountCodec for the counts.tsv extension. I've already done this in a branch.
Change GermlineCNVCaller and DetermineGermlineContigPloidy tools to accept paths.
If an index is present for each count path, create a FeatureDataSource, merge the requested -L/-XL intervals, and query to perform the subset. We will also need to stream the SAM header metadata. It should not require much code to extract all this to a temporary IndexedSimpleCountCollection class. (Caveat: for now, this will work with the current gCNV convention of providing bins via -L/-XL. Technically, it will also work with the more conventional use of -L/-XL to denote contiguous regions, but we may have to perform checks that bins are not duplicated in adjacent shards if they overlap both, since querying a FeatureDataSource will return any bins that overlap the interval---rather than only those that are completely contained within it.)
Index read-count TSVs in the gCNV WGS pipeline after collection and modify the DetermineGermlineContigPloidy and GermlineCNVCaller tasks to take read-count paths and indices, if necessary. These changes could be confined in the gCNV WGS WDL for now.

I think that should do the trick. If this is high priority, I can implement now.

In the future, we might be able to promote all Locatable CNV Records to Features and write code to automatically pass the columns/encoders/decoders (currently held in the Collection corresponding to each Record) to a single Tribble codec. This codec should not depend upon the file extension.

samuelklee · 2019-03-05T14:15:56Z

Hacked together a proof-of-principle in sl_indexed_counts. This enables streaming of indexed counts (when they are specified by a bucket, although we should change this to trigger on the presence of an index) only for the requested intervals in the gCNV step, which would cut disk costs to practically zero. We could enable this in the contig-ploidy step as well, although the cost benefit is not as great there since the tool is short running.

The relevant code clocks in at around +250 lines and could be even more minimal after some cleanup and extraction, but we'd need to update documentation and the WDLs. All previous behavior is preserved.

samuelklee · 2019-08-29T18:57:13Z

@mwalker174 is this something we want to address before my leave?

mwalker174 · 2019-08-30T15:26:57Z

@samuelklee Let's try to get your sl_indexed_counts branch in since it will have a substantial impact on the cost of the clinical SV pipeline. It's fine if it just streams from a bucket, but we do want to update the WDLs to take advantage.

samuelklee · 2019-11-05T19:24:04Z

@mwalker174 Maybe a note here when you get the estimate for disk costs would be useful for posterity?

mwalker174 · 2019-11-07T19:49:33Z

In the SV pipeline, for ~100 samples in case mode the cost breakdown was:

memory: $1.04
disk: $1.50
cpu: $4.59

It seems that we are over-provisioning for disk: amount of disk size used was < 5 GB (only ~$0.14 worth of disk) but the default is 150 gb. Therefore it doesn't seem that we would save much by streaming, but we should lower the default disk size in the WDL.

samuelklee · 2019-11-07T20:55:28Z

As we just discussed at our gCNV meeting, this disk cost is for 2kbp bins. So once 100bp bins are back on the table, we'll be at 20 x $0.14 = $2.80. In the meantime, no reason not to recoup that ~$1.36! Maybe even dump it into memory and decrease the number of shards accordingly?

For now, we should probably just set the appropriate runtime parameter in the clinical SV pipeline, but then at some point (maybe when we update to WDL 1.0) go back and put some more intelligent disk-size estimation into the CNV WDLs. At some point Spec Ops did this for the somatic WDLs (where it probably doesn't make too much of a difference), but we didn't really do anything special for the gCNV WDLs.

mwalker174 · 2021-09-23T21:04:33Z

The bulk of this issue was addressed in #6266. I believe @samuelklee also did some detailed profiling and determined that the memory bottlenecks were inference-related.

samuelklee added the Copy Number tools label Feb 25, 2019

samuelklee mentioned this issue Mar 11, 2019

Improved memory usage in gCNV. #5781

Merged

samuelklee mentioned this issue Nov 18, 2019

Enabled streaming of counts.tsv/counts.tsv.gz files in gCNV CLIs. #6266

Merged

mwalker174 closed this as completed Sep 23, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve capabilities of CNV collections classes. #5716

Improve capabilities of CNV collections classes. #5716

samuelklee commented Feb 25, 2019 •

edited

Loading

samuelklee commented Feb 28, 2019 •

edited

Loading

samuelklee commented Mar 5, 2019

samuelklee commented Aug 29, 2019

mwalker174 commented Aug 30, 2019

samuelklee commented Nov 5, 2019

mwalker174 commented Nov 7, 2019

samuelklee commented Nov 7, 2019

mwalker174 commented Sep 23, 2021

Improve capabilities of CNV collections classes. #5716

Improve capabilities of CNV collections classes. #5716

Comments

samuelklee commented Feb 25, 2019 • edited Loading

samuelklee commented Feb 28, 2019 • edited Loading

samuelklee commented Mar 5, 2019

samuelklee commented Aug 29, 2019

mwalker174 commented Aug 30, 2019

samuelklee commented Nov 5, 2019

mwalker174 commented Nov 7, 2019

samuelklee commented Nov 7, 2019

mwalker174 commented Sep 23, 2021

samuelklee commented Feb 25, 2019 •

edited

Loading

samuelklee commented Feb 28, 2019 •

edited

Loading