Funcotator / Clinical Pipeline should move from ExAC to gnomAD #5259

jonn-smith · 2018-10-04T16:51:57Z

Feature request

Tool(s) or class(es) involved

Funcotator

Description

Currently the data sources for the clinical pipeline work contain ExAC. This must be updated to use gnomAD.

The change will require a new release of the data sources which must be connected to the data source downloader tool.

Additionally, these new data sources must be validated in four ways:

By visually inspecting the gnomAD source file for correctness.
By verifying that the source file for gnomAD does not contain special characters.
By validating that the source file for gnomAD is a valid VCF (assuming it is used VCF format).
By running a large file and spot checking at least 10 variants for correctness.

jonn-smith · 2018-10-16T17:34:03Z

This will seemingly make the clinical pipeline data sources so large they are completely unusable. Gnomad data for the whole genome is apparently 106Gb, exome data is 16Gb (this would be OK, but is at the upper limit) (http://gnomad.broadinstitute.org/downloads).

@LeeTL1220 - what are your thoughts?

jonn-smith · 2018-10-16T18:07:54Z

Sent an email to Niall and Alyssa asking about subsetting the gnomAD files.

jonn-smith · 2018-10-16T18:16:07Z

Also, the gnomAD files are made from B37 data. A liftover to HG38 set exists as well.

jonn-smith · 2018-10-23T18:35:35Z

One potential plan:

filter out any variants that do not PASS
convert to TSV
only keep the AF/AF_* fields
use Exome data

jonn-smith · 2018-11-27T21:38:23Z

We will just read from gnomAD directly using the NIO updates for Funcotator data sources.

Should fix this at the same time as #5428 and #5429.

jonn-smith · 2019-01-25T19:51:39Z

To facilitate the download of gnomAD records, I removed all info annotations that had nothing to do with allele frequency, leaving behind only the fields that we would want to use for our use case.

The WDLs/jsons are now in the funcotator/scripts/data_sources directory (separate PR).

All variants were kept, even those that were filtered.

The network IO slows down Funcotator significantly, but not enough to make it unusable. For this reason, and partly because gnomAD requires an internet connection, the gnomAD data sources are disabled by default.

jonn-smith added this to the Funcotator 1.1 milestone Oct 4, 2018

jonn-smith added Funcotator FuncotatorClinicalPipeline labels Oct 4, 2018

jonn-smith added the Funcotator1.1 label Oct 15, 2018

jonn-smith modified the milestones: Funcotator 1.1, Engine-Q42018 Oct 15, 2018

jonn-smith self-assigned this Oct 16, 2018

jonn-smith added FuncotatorBetaBlocker and removed Funcotator 1.1 labels Nov 27, 2018

This was referenced Nov 27, 2018

Funcotator data sources are swapped #5428

Closed

Must add gnomad (disabled) to all data sources. #5429

Closed

jonn-smith mentioned this issue Jan 28, 2019

Updates for latest data source release. #5614

Merged

droazen closed this as completed in #5614 Jan 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Funcotator / Clinical Pipeline should move from ExAC to gnomAD #5259

Funcotator / Clinical Pipeline should move from ExAC to gnomAD #5259

jonn-smith commented Oct 4, 2018

jonn-smith commented Oct 16, 2018

jonn-smith commented Oct 16, 2018

jonn-smith commented Oct 16, 2018

jonn-smith commented Oct 23, 2018

jonn-smith commented Nov 27, 2018

jonn-smith commented Jan 25, 2019

Funcotator / Clinical Pipeline should move from ExAC to gnomAD #5259

Funcotator / Clinical Pipeline should move from ExAC to gnomAD #5259

Comments

jonn-smith commented Oct 4, 2018

Feature request

Tool(s) or class(es) involved

Description

jonn-smith commented Oct 16, 2018

jonn-smith commented Oct 16, 2018

jonn-smith commented Oct 16, 2018

jonn-smith commented Oct 23, 2018

jonn-smith commented Nov 27, 2018

jonn-smith commented Jan 25, 2019