Skip to content

GVCF output problem #365

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jimmykhchiu opened this issue Feb 28, 2025 · 4 comments
Closed

GVCF output problem #365

jimmykhchiu opened this issue Feb 28, 2025 · 4 comments
Labels
bug Something isn't working

Comments

@jimmykhchiu
Copy link

jimmykhchiu commented Feb 28, 2025

Hi Clair3 Team,

Thanks for your efforts in creating and maintaining this tool.

I got some problems when normalizing the generated GVCF files with bcftools. It complained two things:

  1. Incorrect no. of AF values for the identified alt alleles in the het (0/1) or hom alt (1/1) variants observed. After investigation, we found that the AF value for <NON_REF> is always missing. The VCF header defines AD and AF according to VCF spec as:
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=AF,Number=A,Type=Float,Description="Observed allele frequency in reads, for each ALT allele, in the same order as listed, or the REF allele for a RefCall">

where R and A for attribute Number are defined as:

R: The field has one value for each possible allele, including the reference. The order of the values must be the reference allele first, then the alternate alleles as listed in the ALT column.

A: The field has one value per alternate allele. The values must be in the same order as listed in the ALT column.

Based on this protocol, when there are three alt alleles and <NON_REF> being one of them, the AD field should consist of four integers (including REF allele) and the AF field should consist of three float values. However, what we found in the GVCF is that AF consists of only two while AD still has four, and the missing AF value belongs to <NON_REF>.

  1. Some REF alleles in chromosome Y are reported as 'N' which is not true when referring back to the actual reference sequence.

FYI I am using the latest version of Clair3 (v1.0.10)

Hope can hear from you soon.

Jimmy

@zhengzhenxian
Copy link
Collaborator

@jimmykhchiu

Great thanks for reporting this.

  1. Regarding the AF values, any hint on what value should be assigned to <NON_REF>?

  2. Are the "N" alleles exclusively observed in deletions, or can they also appear in SNP?

Thanks,
Zhenxian

@jimmykhchiu
Copy link
Author

jimmykhchiu commented Mar 11, 2025

Hi Zhenxian,

  1. It should be equal to: the last AD value divided by DP. This is because <NON_REF> is always the last ALT allele and the AD values should follow the same order as their underlying ALT alleles listed.

  2. From our observations so far the false "N" REF alleles appear only in some chromosome Y RefCalls in gvcf, e.g.

chrY	16365274	.	N	<NON_REF>	0	.	END=20428113	GT:GQ:MIN_DP:PL	./.:1:0:0,0,0

The above record suggests a very long "N" segments in chrY but this disagrees with the actual HG38 reference sequence.

Hope this helps.

Jimmy

@jimmykhchiu
Copy link
Author

jimmykhchiu commented Mar 14, 2025

Hi Zhenxian,

I guess the reason for (2) is due to the write_empty_pileup function defined here. When there is no pileup in a region, It simply writes 'N' ref to the gvcf regardless of the actual reference sequence for that region. This may be problematic for female sample where empty pileup is normal for chromosome Y. Of course whether exporting an 'N' ref or the actual sequence base is a bit trivial in this context, but at least bcftools still checks for that without considering the sex. Hope this helps.

Jimmy

@aquaskyline
Copy link
Member

Fixed in v1.0.11, please check.

@aquaskyline aquaskyline added the bug Something isn't working label Mar 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants