-
Notifications
You must be signed in to change notification settings - Fork 602
Improve capabilities of CNV collections classes. #5716
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@vruano @mwalker174 Any estimates on cost that could help determine the priority of issue 1? Specifically, is the disk cost required to localize the entire count file for all samples a determining factor? If we can drastically reduce this cost, then we can dedicate more to increasing resolution, etc. Here is a minimal set of fixes that could enable the querying of intervals for GermlineCNVCaller (and also for DetermineGermlineContigPloidy without too much extra work, since we also subset intervals there) only in the gCNV WGS pipeline, without disrupting other interfaces:
I think that should do the trick. If this is high priority, I can implement now. In the future, we might be able to promote all Locatable CNV Records to Features and write code to automatically pass the columns/encoders/decoders (currently held in the Collection corresponding to each Record) to a single Tribble codec. This codec should not depend upon the file extension. |
Hacked together a proof-of-principle in sl_indexed_counts. This enables streaming of indexed counts (when they are specified by a bucket, although we should change this to trigger on the presence of an index) only for the requested intervals in the gCNV step, which would cut disk costs to practically zero. We could enable this in the contig-ploidy step as well, although the cost benefit is not as great there since the tool is short running. The relevant code clocks in at around +250 lines and could be even more minimal after some cleanup and extraction, but we'd need to update documentation and the WDLs. All previous behavior is preserved. |
@mwalker174 is this something we want to address before my leave? |
@samuelklee Let's try to get your |
@mwalker174 Maybe a note here when you get the estimate for disk costs would be useful for posterity? |
In the SV pipeline, for ~100 samples in case mode the cost breakdown was: memory: $1.04 It seems that we are over-provisioning for disk: amount of disk size used was < 5 GB (only ~$0.14 worth of disk) but the default is 150 gb. Therefore it doesn't seem that we would save much by streaming, but we should lower the default disk size in the WDL. |
As we just discussed at our gCNV meeting, this disk cost is for 2kbp bins. So once 100bp bins are back on the table, we'll be at 20 x $0.14 = $2.80. In the meantime, no reason not to recoup that ~$1.36! Maybe even dump it into memory and decrease the number of shards accordingly? For now, we should probably just set the appropriate runtime parameter in the clinical SV pipeline, but then at some point (maybe when we update to WDL 1.0) go back and put some more intelligent disk-size estimation into the CNV WDLs. At some point Spec Ops did this for the somatic WDLs (where it probably doesn't make too much of a difference), but we didn't really do anything special for the gCNV WDLs. |
The bulk of this issue was addressed in #6266. I believe @samuelklee also did some detailed profiling and determined that the memory bottlenecks were inference-related. |
In order of priority:
See also #3976, #4004, #4717, and #5715 for context.
I think we should first demonstrate if the first issue is really the dominating cost in the germline pipeline. If not, we should first focus on optimizing inference. The other issues are much lower priority.
The text was updated successfully, but these errors were encountered: