-
Notifications
You must be signed in to change notification settings - Fork 602
Misleading StrandOddsRatio value when using --max-mnp-distance #5698
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
@tfenne the bam would definitely be useful for the MNP issue (although likely not necessary), but if you can give me just the 2x2 contingency table (the FORMAT SB) then I can take a look at that pseudo count issue. |
Actually the contingency tables for both variants in the split case and for the MNP would be super useful. You have to run in GVCF mode to generate them. |
Thanks @ldgauthier! I think this is what you're asking for:
I should also be able to share the BAM later today - I just need to sanitize it a little before sharing. |
Perfect, thanks! I'm pretty sure I've got it, in which case I don't need the bam. The contingency table is ref vs. not ref. For the 0|1 GT, those 57 reads are considered reference, but in the 1/2 case those 57 reads are considered alts. On one level, that seems fair to me because none of those reads match the reference at both positions. It does make for a less than informative comparison, though. But is measuring the bias of one alt against another informative for filtering? For MQRankSum I stand by comparing the alt mapping to the reference, but I have to think about this one some more. Using allele-specific annotations as I've implemented them isn't going to help here either because all the comparisons (strand bias or rank sum) are made against the reference. I definitely agree that the SOR for the homVar is going to be bad for filtering and I'll fix it. |
Thanks @ldgauthier, that's definitely thought provoking. Perhaps you are right that for het-non-ref (i.e. 1/2 and similar) genotypes we should consider skipping the strand bias test. I had been assuming, incorrectly, that in a single sample case with a Just thinking out loud about this, I wonder if computing SOR on the I'll read up on how the allele-specific strand bias works because I've never looked at that before. But I wonder if it's possible from the available annotations to get a SOR-like value for each allele that is that allele vs all other alleles? |
One more question related to this. In playing around I've noticed that if I run HC in GVCF mode with |
Hi @tfenne, it may be that https://hail.is/ will allow you to do what you wish to do. Might be worth checking out. |
@tfenne there's no commandline option now, but it would be easy to add one. The line that's removing AS_SB_TABLE is here: gatk/src/main/java/org/broadinstitute/hellbender/tools/walkers/annotator/VariantAnnotatorEngine.java Line 224 in 69f4006
You'll probably need to add a new field to VariantAnnotatorEngine to be set via GenotypeGVCFs, but that's easy peasy. I can toss this in with the F1R2 fix, but it might be a couple of days. |
Bug Report
Affected tool(s) or class(es)
HaplotypeCaller with
--max-mnp-distance
filter.Affected version(s)
Description
I think there's a problem with the StrandOddsRatio (SOR) annotation and the
--map-mnp-distance
flag. I'm looking at a small region of NA24143 (one of the GIAB samples). There's a pair of SNPs in very close proximity. When called without the MNP output I get a pair of variants as follows (some info removed for clarity), coordinates are HG19:I'm trying to get permission to share the BAM over this region, but the key information is that every single read that spans or is in proximity to these variants is on the R strand. There is zero F strand coverage.
This seems reasonable. It's a bit odd to me that the first SNP which is hom-var has a SOR value of 9.825, but it's homozygous so it's more or less irrelevant. Looking at the code, I think the problem here is that the code avoids divide-by-zero errors by adding pseudo-counts of
1.0
to the table, which for homozygous variants with no coverage on one strand creates a weird situation. I think it would be better to just detect if all coverage is on one strand and short-circuit the calculation, but I digress.The real problem comes when running with
--max-mnp-distance 5
. Then I get this single variant:Now I have a het variant with an SOR of 9.75. This seems really wrong to me - note how FS is 0.0. Again all coverage of all alleles is on one strand. And the het SNP that forms part of this MNP had an SOR of 0.983 when called independently. Since the first SNP is hom-var and the second is het, I would have expected the SOR value for the MNP call to closely mirror that of the het SNP.
My suspicion is that what's going on here is probably that the calculation is being run using the contingency table for the hom-var SNP that's first in the MNP, perhaps filtered to only reads that span the whole MNP, since the value is marginally lower.
Steps to reproduce
I will try and post a BAM snippet later, but essentially to reproduce I think a synthetic test case with the following properties would work:
--max-mnp-distance 5
Expected behavior
SOR value should be close in value to that of the het-snp called in isolation.
Actual behavior
SOR value is in fact similar to that of the HomVar snp.
The text was updated successfully, but these errors were encountered: