Skip to content

Added a force output sites argument to GenotypeGVCFs #6263

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Dec 3, 2019

Conversation

davidbenjamin
Copy link
Contributor

Closes #6239.

@ldgauthier @bbimber Here's what I settled on:

  • GenotypeGVCFs gets a new argument for force-genotyping intervals. This is parsed exactly like any other interval argument, so the following are all valid:
--force-output-intervals 20:45000
--force-output-intervals intervals.interval_list
--force-output-intervals dbsnp.vcf
--force-output-intervals 20:1000-2000 --force-output-intervals 20:3000-4000
  • Only the sites matter, so if a VCF is given the alleles are ignored.
  • If the site is absent in the input gVCF, there is no output. No data in yields no data out.
  • If the site overlaps a record in the input gVCF, there is output, regardless of whether the record is a variant, a REF block, or a spanning deletion.

@davidbenjamin
Copy link
Contributor Author

As far as I can tell the failures are just Travis acting up. GenotypeGVCFsIntegrationTest passes locally.

@ldgauthier
Copy link
Contributor

@davidbenjamin What's the difference between --force-output-intervals myIntervals.list and --include-non-variants -L myIntervals.list?

@bbimber
Copy link
Contributor

bbimber commented Nov 26, 2019

@ldgauthier I have not had a chance to check out this code in detail (I'll try to get someone in my lab to do this today); however, I would hope that the difference is that --force-output-intervals would output at those intervals (whether variant or not) AND include any additional variant sites. The latter, --include-non-variants and -L would only output the intervals from -L.

@ldgauthier
Copy link
Contributor

Yes, thank you for jogging my memory @bbimber. @davidbenjamin adding something to that extend in the BadArgumentException message would be helpful.

@davidbenjamin
Copy link
Contributor Author

@ldgauthier I wrote a detailed error message. Anything else?

Copy link
Contributor

@ldgauthier ldgauthier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! :shipit:

@davidbenjamin davidbenjamin merged commit 2f26904 into master Dec 3, 2019
@davidbenjamin davidbenjamin deleted the db_force_genotype_gvcfs branch December 3, 2019 18:49
@bbimber
Copy link
Contributor

bbimber commented Jan 18, 2020

@davidbenjamin Hello, we started using this feature more seriously and have a question. the output from GenotypeGVCFs with --force-output-sites seems to include sites with <NON_REF> as the ALT allele. Is this expected? I am guessing that these sites used to be filtered (no passing evidence of variation), and the point of this feature is the include sites based on whitelist?

The problem is that some downstream tools dont know what to do with this. There is probably some subtlety here, but I would think either a given sample has callable variation, it is REF, or it is no-call? Is <NON_REF> in the output by design?

@davidbenjamin
Copy link
Contributor Author

@bbimber Do I understand correctly that your PR #6406 resolves the issue you found?

@bbimber
Copy link
Contributor

bbimber commented Jan 22, 2020

yes, it seems to. it seems this was a simple bug where GenotypeGVCFsEngine.removeNonRefAlleles() wasnt working as intended for multi-allelic sites; however, I'm not sure I understand the entire genotyping process well enough to be certain on this. we can iterate on #6406

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

PR for GenotypeGVCFs and Non-variant Sites?
3 participants