Skip to content

Add warning to SelectVariants about poor performance if the samples are not sorted in the VCF header #7732

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
droazen opened this issue Mar 21, 2022 · 2 comments · Fixed by #7887
Assignees
Labels
learn GATK Suitable for GATK beginners

Comments

@droazen
Copy link
Contributor

droazen commented Mar 21, 2022

Apparently SelectVariants is ~10x slower when the samples in the VCF header are not sorted, due to the need to reorder the genotypes on output. We should at least warn the user when unsorted sample names are detected in the input.

(discovered by @epiercehoffman)

@epiercehoffman
Copy link
Contributor

epiercehoffman commented Mar 21, 2022

Thanks for opening this issue. I hope that this performance issue can be fixed in HTSJDK soon, but I agree a warning would be useful in the intervening time.

To clarify, I believe this applies to any tool that loads a VCF but does not need to parse genotypes - not just SelectVariants. For instance, I saw a 5-10x slowdown in SVAnnotate with unsorted sample IDs for a VCF with ~2500 samples.

Another note: GATK did not reorder the sample IDs in the output VCF during my tests of SVAnnotate, but did reorder IDs during SelectVariants.

@mwalker174
Copy link
Contributor

A warning message would be helpful, although I doubt most Terra users read their logs unless there's an error. What are the chances this can get addressed in htsjdk?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
learn GATK Suitable for GATK beginners
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants