Skip to content

Funcotator / Clinical Pipeline should move from ExAC to gnomAD #5259

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jonn-smith opened this issue Oct 4, 2018 · 6 comments
Closed

Funcotator / Clinical Pipeline should move from ExAC to gnomAD #5259

jonn-smith opened this issue Oct 4, 2018 · 6 comments
Assignees
Milestone

Comments

@jonn-smith
Copy link
Collaborator

Feature request

Tool(s) or class(es) involved

Funcotator

Description

Currently the data sources for the clinical pipeline work contain ExAC. This must be updated to use gnomAD.

The change will require a new release of the data sources which must be connected to the data source downloader tool.

Additionally, these new data sources must be validated in four ways:

  • By visually inspecting the gnomAD source file for correctness.
  • By verifying that the source file for gnomAD does not contain special characters.
  • By validating that the source file for gnomAD is a valid VCF (assuming it is used VCF format).
  • By running a large file and spot checking at least 10 variants for correctness.

@jonn-smith jonn-smith added this to the Funcotator 1.1 milestone Oct 4, 2018
@jonn-smith jonn-smith modified the milestones: Funcotator 1.1, Engine-Q42018 Oct 15, 2018
@jonn-smith jonn-smith self-assigned this Oct 16, 2018
@jonn-smith
Copy link
Collaborator Author

This will seemingly make the clinical pipeline data sources so large they are completely unusable. Gnomad data for the whole genome is apparently 106Gb, exome data is 16Gb (this would be OK, but is at the upper limit) (http://gnomad.broadinstitute.org/downloads).

@LeeTL1220 - what are your thoughts?

@jonn-smith
Copy link
Collaborator Author

Sent an email to Niall and Alyssa asking about subsetting the gnomAD files.

@jonn-smith
Copy link
Collaborator Author

Also, the gnomAD files are made from B37 data. A liftover to HG38 set exists as well.

@jonn-smith
Copy link
Collaborator Author

One potential plan:

  • filter out any variants that do not PASS
  • convert to TSV
  • only keep the AF/AF_* fields
  • use Exome data

@jonn-smith
Copy link
Collaborator Author

We will just read from gnomAD directly using the NIO updates for Funcotator data sources.

Should fix this at the same time as #5428 and #5429.

@jonn-smith
Copy link
Collaborator Author

To facilitate the download of gnomAD records, I removed all info annotations that had nothing to do with allele frequency, leaving behind only the fields that we would want to use for our use case.

The WDLs/jsons are now in the funcotator/scripts/data_sources directory (separate PR).

All variants were kept, even those that were filtered.

The network IO slows down Funcotator significantly, but not enough to make it unusable. For this reason, and partly because gnomAD requires an internet connection, the gnomAD data sources are disabled by default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant