Skip to content

HaplotypeCaller doesn't detect alternate alleles with 1 bp intervals #6495

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jemunro opened this issue Mar 11, 2020 · 9 comments · Fixed by #6507
Closed

HaplotypeCaller doesn't detect alternate alleles with 1 bp intervals #6495

jemunro opened this issue Mar 11, 2020 · 9 comments · Fixed by #6507

Comments

@jemunro
Copy link

jemunro commented Mar 11, 2020

Bug Report

Affected tool(s) or class(es)

HaplotypeCaller

Affected version(s)

4.1.5.0

Description

HaplotypeCaller doesn't detect alternate alleles when 1 bp intervals are provided. With v4.1.4.1 on the same inputs, 361 sites with alternate alleles are detected. When no intervals are provided to v4.1.5.0, the sites are also detected.

Steps to reproduce

gatk HaplotypeCaller \
    --java-options "-Xmx5G -Djava.io.tmpdir=." \
    -R ref.fa \
    -I input.bam \
    -O output.g.vcf.gz \
    -L targets.list \
    --sample-ploidy 1 \
    --max-alternate-alleles 3 \
    -ERC BP_RESOLUTION

Content of targets.list (32652 sites):

head targets.list
Chromosome:29
Chromosome:237
Chromosome:371
Chromosome:380
Chromosome:467
Chromosome:490
Chromosome:782
Chromosome:1053
Chromosome:1126
Chromosome:1131

Expected behavior

zcat output.g.vcf.gz | grep -v '^#' | wc -l
32652
zcat output.g.vcf.gz  | grep -v '^#'  | grep ',<NON_REF>' | wc -l
361

Actual behavior

zcat output.g.vcf.gz | grep -v '^#' | wc -l
32652
zcat output.g.vcf.gz  | grep -v '^#' | grep ',<NON_REF>' | wc -l
0
@ldgauthier
Copy link
Contributor

Hi @jemunro,
Can you confirm that the genotypes for the sites output with alternate alleles have those alts called? I've seen some discrepancies in the past with how we treat alleles that were discovered but genotyped as homozygous reference versus reference blocks.

@jemunro
Copy link
Author

jemunro commented Mar 12, 2020

Hi @ldgauthier ,

I can confirm that the alt alleles are genotyped as '1' using v4.1.4.1, see output below.

zcat output.g.vcf.gz | grep -v '^#' | grep ',<NON_REF>' | head -n5
Chromosome	4013	.	T	C,<NON_REF>	820.01	.	DP=22;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=79200,22	GT:AD:DP:GQ:PL:SB	1:0,22,0:22:99:830,0,830:0,0,12,10
Chromosome	6140	.	G	T,<NON_REF>	669.01	.	DP=19;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=68400,19	GT:AD:DP:GQ:PL:SB	1:0,19,0:19:99:679,0,679:0,0,6,13
Chromosome	14251	.	G	A,<NON_REF>	1014.01	.	DP=28;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=100800,28	GT:AD:DP:GQ:PL:SB	1:0,28,0:28:99:1024,0,1024:0,0,19,9
Chromosome	17608	.	G	C,<NON_REF>	1045.01	.	DP=30;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=108000,30	GT:AD:DP:GQ:PL:SB	1:0,30,0:30:99:1055,0,1055:0,0,20,10
Chromosome	21795	.	G	A,<NON_REF>	1692.01	.	DP=46;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=165600,46	GT:AD:DP:GQ:PL:SB	1:0,46,0:46:99:1702,0,1702:0,0,26,20

Looking for the corresponding first site using the v4.1.5.0 output gives the following:

zcat output.g.vcf.gz | grep -v '^#' | grep -P '\t4013\t'
Chromosome	4013	.	T	<NON_REF>	.	.	.	GT:AD:DP:GQ:PL	0:0,28:28:0:0,0

Also note that adding --interval-padding 25 to v4.1.5.0 seems to return the correct output:

zcat output.g.vcf.gz | grep -v '^#' | grep -P '\t4013\t'
Chromosome	4013	.	T	C,<NON_REF>	820.04	.	DP=28;MLEAC=1,0;MLEAF=1.00,0.00;RAW_MQandDP=100800,28	GT:AD:DP:GQ:PL:SB	1:0,22,0:22:99:830,0,830:0,0,12,10

@ldgauthier
Copy link
Contributor

That definitely looks like a bug to me. I can poke around, but it will go much faster if you can upload a little snippet of the bam you're calling on to our FTP server: https://gatk.broadinstitute.org/hc/en-us/articles/360035889671

@jemunro
Copy link
Author

jemunro commented Mar 13, 2020

@ldgauthier I've uploaded the snippet to the FTP server in a file named 'gatk_issue_6495.tar.gz', let me know if you have any issues.

@ldgauthier
Copy link
Contributor

Great, I will try to take a look this week.

@ldgauthier
Copy link
Contributor

It looks like #6358 caused a regression. I'm going to try to figure it out, but we might have to enlist @davidbenjamin

@ldgauthier
Copy link
Contributor

It's this line here that's causing problems

final SimpleInterval paddedVariantSpan = new SimpleInterval(region.getContig(), minStart, maxEnd).intersect(region);

Because the specified intervals are just 1bp, the intersection yields a 1bp interval, which might be fine, but all the reads are being throw away later on because they're now only 1bp long.
This is probably the part of HaplotypeCaller that I'm the least familiar with, but I'll give it a shot.

@davidbenjamin
Copy link
Contributor

@ldgauthier Since this is most likely my fault, should it also be my responsibility? I'm not so incapacitated by school closings that I can't fix bugs.

@davidbenjamin
Copy link
Contributor

@ldgauthier I'm about to submit a bug fix PR. In the line you found the .intersect(region) should be .intersect(region.getPaddedSpan()). The intersection is to avoid a bug where the requested trimmed padded region is bigger than the original padded region, but intersect(region) causes it to lie within the original unpadded region, which is unnecessary and probably harmful to sensitivity (the trimming cigar didn't hurt sensitivity, but I wonder if this mistake may have offset a net benefit that it should have created).

I replicated @jemunro's error in the branch, fixed it (of course), and wrote an equivalent regression test that fails before and passes after the PR. The rest is just the usual annoying updating of integration test files, which as of right now I'm in the middle of.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants