-
Notifications
You must be signed in to change notification settings - Fork 603
GenomicsDBImport not completing for mixed ploidy samples #6275
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I'm going to tag in our Genomics DB experts to take a look at this: @nalinigans @mlathara |
Link to user data: |
Thanks for the files. We've identified the issue as a bug in htslib - pretty straightforward fix. The mixed ploidy is not a problem in this case. But the confluence of multiploidy, number of samples and number of alternate alleles causes some locations in the combined gvcf to have very large arrays of FORMAT data. That triggers an (unforeseen) overflow in htslib. With Thanksgiving coming up, we won't be able to get the fix out in a released GenomicsDB this week. Can try for early next week, depending on when you're next doing a GATK release @droazen |
Has this been fixed in 4.1.4.1 or should I wait for the next release? Had the same issue with GenotypeGVCFs in 4.1.4.0 using a GenomicsDB database with five GVCFs with pooled samples (12 samples per pool). |
Not fixed in 4.1.4.1. Same issue. |
@marcopessoa, the fixes for this issue are not in master yet - here is the pending PR #6305. You could use the genomicsdb_120_1 branch to test out your scenario and post what you find to provide further validation for the changes in the PR. Thanks. |
I downloaded and built genomicsdb_120_1 from source. I deleted my previous GenomicsDB database, and built a new one with the following options:
I then ran GenotypeGVCFs with the following options for one of the chromosomes:
It failed at the same region it was failing before, with this error message:
HaplotypeCaller ran without errors, so I don't know if I should generate new GVCFs. Each input GVCF is a pool of 12 diploid plants. These are the INFO lines when I run GenomicsDBImport:
And GenotypeGVCFs:
|
Based on the messages you have provided. looks like you have built and run the GATK master branch. You should switch to the genomicsdb_120_1 branch (git checkout genomicsdb_120_1) and retry. You don't need to delete the imported workspace. You should see the version string 4.1.4.1-7-gdb054ab-SNAPSHOT in the GATK stdout. |
Apologies! Switched to the correct branch. I tested it with a troublesome region I identified in one of the chromosomes, using the following command:
Run info:
Run starts, a few variants are detected. Then It gets stuck in a region for about 30 minutes, and prints an out-of-memory error:
Changed heap size to 56 Gb and 128 Gb, and the problem persists. |
It could be that the large number of genotypes being returned by GenomicsDB causes the GenotypeGVCFs to consume a really large amount of memory. I think @mlathara is working on a possible fix in #6377. |
I ran the following command:
It might help to mention that I used the GATK 3.8 to rerun
Still, it manages to ignore these regions, genotype samples and output a valid VCF file. I assume these are the regions causing problems in both GenotypeGVCFs and SelectVariants in the GATK 4.0. Would high sample sequence coverage lead to such issues? Please let me know if you need any information about the experiment. |
@marcopessoa would you mind trying out a potential fix for us? If you could check out branch The current defaults for If the fix doesn't work, and you're willing to share a minimal set of your data that can reproduce the issue, we'll be happy to take a closer look. Thanks! |
I rebuilt the new branch and got the same Should I still see v4.1.4.1-7-gdb054ab-SNAPSHOT for this fix? Please let me know what data I should provide for you to reproduce this on your side. Thanks a lot for your help. |
You should see Try doing a |
Finally managed to build the correct snapshot. It ran successfully in the problem region! It also reported Sample/Callsets with too many genotypes in combined VCF. I will keep testing it in other regions, but I can see now that However, the current limit for genotype counts is set to 100000 (I could see that in the code for this PR in the file GenomicsDBOptions.java, guess that was hard coded), and it remained this way even if I set it to 1024 with the Not sure if this also fixes the SelectVariants issue, though. Once again, thanks a lot for your help. You guys are amazing! Can't wait to see this in a new release. |
Great to hear - my mistake on the hardcoded 100000. I was just trying to make that the default, will get that fixed to hew to the argument passed in. SelectVariants will still fail with this branch - we'll have to do something similar for that as well. |
Is the |
@marcopessoa, not yet. We will work on this early this week. |
@marcopessoa, I was hoping to get PR #6305 into the release this week, but it is still ongoing. Meanwhile, please use the genomicsdb_120_1 branch where you can specify |
Closing as resolved |
I was able to replicate the users error with GATK4.1.1.0 and the latest build on Nov21. The users data is on the FTP as livinlrg_Problem_Interval.
Running SelectVariants on the same data generates the same error as GenomicsDBImport
User Question (LINK)
I'm attempting to call variants on whole genomes for about 500 illumina paired-end samples with varying ploidy (haploid to tetraploid). I'm running a fairly standard uBam to GVCF pipeline with HaplotypeCaller passed the ploidy information (1,2,3, or 4) in -ERC GVCF mode. I then try to collect the GVCFs using GenomicsDBImport in a batch size of 50 and use GenotypeGVCFs on the combined database. My interval list that is passed to GenomicsDBImport is just each chromosome on a separate line. I'm using GATK v4.1.1.0
Command:
The text was updated successfully, but these errors were encountered: