Skip to content

Commit 079d34a

Browse files
Adaptive pruning option for local assembly (#5473)
1 parent 6096004 commit 079d34a

24 files changed

+676
-317
lines changed

docs/local_assembly.pdf

-4.74 KB
Binary file not shown.

docs/local_assembly.tex

+6-1
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,12 @@ \section{Building the graph} \label{graph-assembly}
7575
\section{Cleaning the Graph} \label{graph-cleaning}
7676
Before deciding on candidate haplotypes, the assembler simplifies the graph with the following heuristics to remove spurious paths and to merge variant paths that diverge from the reference.
7777
\begin{itemize}
78-
\item pruning: The assembler finds all maximal non-branching subgraphs and removes those that 1) do not share an edge with the reference path and 2) contain no edges with sufficient multiplicity\footnote{By default 2. This is controlled by the \code{minPruning} argument.} While the default multiplicity threshold of 2 is quite permissive, it \textit{does} cause \Mutect~ to lose sensitivity for deletions occurring in a single read\footnote{While a SNV occurring on a single read would not yield a confident somatic variant call, a long deletion in a non-STR context could easily be supported by a single read be due to the tiny probability of its arising from sequencing error.}.
78+
\item pruning: The assembler finds all maximal non-branching subgraphs (``chains") and removes those that 1) do not share an edge with the reference path and 2) contain no edges with sufficient multiplicity\footnote{By default 2. This is controlled by the \code{minPruning} argument.} While the default multiplicity threshold of 2 is quite permissive, it \textit{does} cause \Mutect~ to lose sensitivity for deletions occurring in a single read\footnote{While a SNV occurring on a single read would not yield a confident somatic variant call, a long deletion in a non-STR context could easily be supported by a single read be due to the tiny probability of its arising from sequencing error.}.
79+
80+
There is a command line flag \code{--adaptive-pruning} to turn on an adaptive pruning algorithm that adjusts itself to both the local depth of coverage and the observed sequencing error rate and removes chains based on a likelihood score. The score of a chain is the maximum of a left score and a right score, where the score on left (right) end of the chain is the active region determination log likelihood from the \code{Mutect2Engine}, treating the first (last) edge of the chain as a potential variant reads and all other outgoing (incoming) edges of the first (last) vertex in the chain as ref reads. The adaptive algorithm does this in two passes, where the first pass is used to determine likely errors from which to determine an empirical guess of the error rate.
81+
82+
The adaptive pruning option is extremely useful for samples with high coverage, such as mitochondria and targeted panels, and for samples with variable coverage, such as exomes and RNA.
83+
7984
\item dangling tails: The assembler only outputs haplotypes that start and end with a reference kmer, so it attempts to rescue paths in the graph that do not. To rescue a ``dangling tails" -- a path that ends in a non-reference kmer vertex -- the assembler first traverses the graph backwards from this vertex to a reference vertex. If during traversal it encounters a vertex with more than one incoming edge it gives up\footnote{as opposed to doing eg depth-first search of all possible paths back to the reference.} It also gives up if it encounters a vertex with more than one outgoing edge, that is, if the path branches again after diverging from the reference\footnote{It seems like this could be changed to increase sensitivity.}. Then it generates the Smith-Waterman alignment of the branching path versus the reference path after the vertex at which they diverge. If the alignment's CIGAR contains three or fewer elements, that is, if the alignment has at most one indel, the assembly engine attempts to merge the dangling tail back into the reference.
8085

8186
To merge the dangling tail back into the reference path, the assembler finds the beginning of the maximal common suffix of the dangling path and the reference path, that is, the point at which the sequences coverges\footnote{this is \textit{not} where the \textit{paths in the graph} converge (they don't) because kmers in the suffix disagree with the ref at upstream bases.} and adds an edge between the dangling path's vertex and the reference path's vertex at this position. This means that the graph is no longer a valid de Bruijn graph because the dangling vertex kmer and its succeeding reference vertex kmer do not overlap by $k - 1$ bases. Nonetheless, this graph yields valid haplotypes when we later ``zip'' the graph's chains (see below) by accumulating the last base of each kmer.

src/main/java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/AssemblyBasedCallerArgumentCollection.java

+4-1
Original file line numberDiff line numberDiff line change
@@ -29,8 +29,11 @@ public abstract class AssemblyBasedCallerArgumentCollection extends StandardCall
2929
@ArgumentCollection
3030
public AssemblyRegionTrimmerArgumentCollection assemblyRegionTrimmerArgs = new AssemblyRegionTrimmerArgumentCollection();
3131

32+
protected boolean useMutectAssemblerArgumentCollection() { return false; }
33+
3234
@ArgumentCollection
33-
public ReadThreadingAssemblerArgumentCollection assemblerArgs = new ReadThreadingAssemblerArgumentCollection();
35+
public ReadThreadingAssemblerArgumentCollection assemblerArgs = useMutectAssemblerArgumentCollection() ?
36+
new MutectReadThreadingAssemblerArgumentCollection() : new HaplotypeCallerReadThreadingAssemblerArgumentCollection();
3437

3538
@ArgumentCollection
3639
public LikelihoodEngineArgumentCollection likelihoodArgs = new LikelihoodEngineArgumentCollection();

src/main/java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/AssemblyBasedCallerUtils.java

+1-10
Original file line numberDiff line numberDiff line change
@@ -189,19 +189,10 @@ public static ReadLikelihoodCalculationEngine createLikelihoodCalculationEngine(
189189

190190
public static ReadThreadingAssembler createReadThreadingAssembler(final AssemblyBasedCallerArgumentCollection args) {
191191
final ReadThreadingAssemblerArgumentCollection rtaac = args.assemblerArgs;
192-
final ReadThreadingAssembler assemblyEngine = new ReadThreadingAssembler(rtaac.maxNumHaplotypesInPopulation, rtaac.kmerSizes, rtaac.dontIncreaseKmerSizesForCycles, rtaac.allowNonUniqueKmersInRef, rtaac.numPruningSamples);
193-
assemblyEngine.setErrorCorrectKmers(rtaac.errorCorrectKmers);
194-
assemblyEngine.setPruneFactor(rtaac.minPruneFactor);
192+
final ReadThreadingAssembler assemblyEngine = rtaac.makeReadThreadingAssembler();
195193
assemblyEngine.setDebug(args.debug);
196-
assemblyEngine.setDebugGraphTransformations(rtaac.debugGraphTransformations);
197-
assemblyEngine.setRecoverDanglingBranches(!rtaac.doNotRecoverDanglingBranches);
198-
assemblyEngine.setMinDanglingBranchLength(rtaac.minDanglingBranchLength);
199194
assemblyEngine.setMinBaseQualityToUseInAssembly(args.minBaseQualityScore);
200195

201-
if ( rtaac.graphOutput != null ) {
202-
assemblyEngine.setGraphWriter(new File(rtaac.graphOutput));
203-
}
204-
205196
return assemblyEngine;
206197
}
207198

src/main/java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/HaplotypeCallerEngine.java

+2-2
Original file line numberDiff line numberDiff line change
@@ -263,7 +263,7 @@ private void validateAndInitializeArgs() {
263263
hcArgs.setSampleContamination(AlleleBiasedDownsamplingUtils.loadContaminationFile(hcArgs.CONTAMINATION_FRACTION_FILE, hcArgs.CONTAMINATION_FRACTION, sampleSet, logger));
264264
}
265265

266-
if ( hcArgs.genotypingOutputMode == GenotypingOutputMode.GENOTYPE_GIVEN_ALLELES && hcArgs.assemblerArgs.consensusMode ) {
266+
if ( hcArgs.genotypingOutputMode == GenotypingOutputMode.GENOTYPE_GIVEN_ALLELES && hcArgs.assemblerArgs.consensusMode() ) {
267267
throw new UserException("HaplotypeCaller cannot be run in both GENOTYPE_GIVEN_ALLELES mode and in consensus mode at the same time. Please choose one or the other.");
268268
}
269269

@@ -604,7 +604,7 @@ public List<VariantContext> callRegion(final AssemblyRegion region, final Featur
604604
assemblyResult.getPaddedReferenceLoc(),
605605
regionForGenotyping.getSpan(),
606606
features,
607-
(hcArgs.assemblerArgs.consensusMode ? Collections.<VariantContext>emptyList() : givenAlleles),
607+
(hcArgs.assemblerArgs.consensusMode() ? Collections.<VariantContext>emptyList() : givenAlleles),
608608
emitReferenceConfidence(),
609609
hcArgs.maxMnpDistance,
610610
readsHeader,
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,61 @@
1+
package org.broadinstitute.hellbender.tools.walkers.haplotypecaller;
2+
3+
import org.broadinstitute.barclay.argparser.Advanced;
4+
import org.broadinstitute.barclay.argparser.Argument;
5+
import org.broadinstitute.barclay.argparser.Hidden;
6+
import org.broadinstitute.hellbender.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler;
7+
8+
import java.io.File;
9+
10+
public class HaplotypeCallerReadThreadingAssemblerArgumentCollection extends ReadThreadingAssemblerArgumentCollection {
11+
private static final long serialVersionUID = 6520834L;
12+
/**
13+
* A single edge multiplicity cutoff for pruning doesn't work in samples with variable depths, for example exomes
14+
* and RNA. This parameter enables the probabilistic algorithm for pruning the assembly graph that considers the
15+
* likelihood that each chain in the graph comes from real variation.
16+
*/
17+
@Advanced
18+
@Argument(fullName="adaptive-pruning", doc = "Use Mutect2's adaptive graph pruning algorithm", optional = true)
19+
public boolean useAdaptivePruning = false;
20+
21+
/**
22+
* By default, the read threading assembler will attempt to recover dangling heads and tails. See the `minDanglingBranchLength` argument documentation for more details.
23+
*/
24+
@Hidden
25+
@Argument(fullName="do-not-recover-dangling-branches", doc="Disable dangling head and tail recovery", optional = true)
26+
public boolean doNotRecoverDanglingBranches = false;
27+
28+
/**
29+
* As of version 3.3, this argument is no longer needed because dangling end recovery is now the default behavior. See GATK 3.3 release notes for more details.
30+
*/
31+
@Deprecated
32+
@Argument(fullName="recover-dangling-heads", doc="This argument is deprecated since version 3.3", optional = true)
33+
public boolean DEPRECATED_RecoverDanglingHeads = false;
34+
35+
/**
36+
* This argument is specifically intended for 1000G consensus analysis mode. Setting this flag will inject all
37+
* provided alleles to the assembly graph but will not forcibly genotype all of them.
38+
*/
39+
@Advanced
40+
@Argument(fullName="consensus", doc="1000G consensus mode", optional = true)
41+
public boolean consensusMode = false;
42+
43+
@Override
44+
public ReadThreadingAssembler makeReadThreadingAssembler() {
45+
final ReadThreadingAssembler assemblyEngine = new ReadThreadingAssembler(maxNumHaplotypesInPopulation, kmerSizes,
46+
dontIncreaseKmerSizesForCycles, allowNonUniqueKmersInRef, numPruningSamples, useAdaptivePruning ? 0 : minPruneFactor,
47+
useAdaptivePruning, initialErrorRateForPruning, pruningLog10OddsThreshold, maxUnprunedVariants);
48+
assemblyEngine.setDebugGraphTransformations(debugGraphTransformations);
49+
assemblyEngine.setRecoverDanglingBranches(!doNotRecoverDanglingBranches);
50+
assemblyEngine.setMinDanglingBranchLength(minDanglingBranchLength);
51+
52+
if ( graphOutput != null ) {
53+
assemblyEngine.setGraphWriter(new File(graphOutput));
54+
}
55+
56+
return assemblyEngine;
57+
}
58+
59+
@Override
60+
public boolean consensusMode() { return consensusMode; }
61+
}
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,36 @@
1+
package org.broadinstitute.hellbender.tools.walkers.haplotypecaller;
2+
3+
import org.broadinstitute.barclay.argparser.Advanced;
4+
import org.broadinstitute.barclay.argparser.Argument;
5+
import org.broadinstitute.hellbender.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler;
6+
7+
import java.io.File;
8+
9+
public class MutectReadThreadingAssemblerArgumentCollection extends ReadThreadingAssemblerArgumentCollection {
10+
private static final long serialVersionUID = 5304L;
11+
12+
/**
13+
* A single edge multiplicity cutoff for pruning doesn't work in samples with variable depths, for example exomes
14+
* and RNA. This parameter disables the probabilistic algorithm for pruning the assembly graph that considers the
15+
* likelihood that each chain in the graph comes from real variation, and instead uses a simple multiplicity cutoff.
16+
*/
17+
@Advanced
18+
@Argument(fullName="disable-adaptive-pruning", doc = "Disable the adaptive algorithm for pruning paths in the graph", optional = true)
19+
public boolean disableAdaptivePruning = false;
20+
21+
@Override
22+
public ReadThreadingAssembler makeReadThreadingAssembler() {
23+
final ReadThreadingAssembler assemblyEngine = new ReadThreadingAssembler(maxNumHaplotypesInPopulation, kmerSizes,
24+
dontIncreaseKmerSizesForCycles, allowNonUniqueKmersInRef, numPruningSamples, disableAdaptivePruning ? minPruneFactor : 0,
25+
!disableAdaptivePruning, initialErrorRateForPruning, pruningLog10OddsThreshold, maxUnprunedVariants);
26+
assemblyEngine.setDebugGraphTransformations(debugGraphTransformations);
27+
assemblyEngine.setRecoverDanglingBranches(true);
28+
assemblyEngine.setMinDanglingBranchLength(minDanglingBranchLength);
29+
30+
if ( graphOutput != null ) {
31+
assemblyEngine.setGraphWriter(new File(graphOutput));
32+
}
33+
34+
return assemblyEngine;
35+
}
36+
}

src/main/java/org/broadinstitute/hellbender/tools/walkers/haplotypecaller/ReadThreadingAssemblerArgumentCollection.java

+29-29
Original file line numberDiff line numberDiff line change
@@ -4,14 +4,15 @@
44
import org.broadinstitute.barclay.argparser.Advanced;
55
import org.broadinstitute.barclay.argparser.Argument;
66
import org.broadinstitute.barclay.argparser.Hidden;
7+
import org.broadinstitute.hellbender.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler;
78

89
import java.io.Serializable;
910
import java.util.List;
1011

1112
/**
1213
* Set of arguments related to the {@link org.broadinstitute.hellbender.tools.walkers.haplotypecaller.readthreading.ReadThreadingAssembler}
1314
*/
14-
public final class ReadThreadingAssemblerArgumentCollection implements Serializable {
15+
public abstract class ReadThreadingAssemblerArgumentCollection implements Serializable {
1516
private static final long serialVersionUID = 1L;
1617

1718
// -----------------------------------------------------------------------------------------------
@@ -48,20 +49,6 @@ public final class ReadThreadingAssemblerArgumentCollection implements Serializa
4849
@Argument(fullName="num-pruning-samples", doc="Number of samples that must pass the minPruning threshold", optional = true)
4950
public int numPruningSamples = 1;
5051

51-
/**
52-
* As of version 3.3, this argument is no longer needed because dangling end recovery is now the default behavior. See GATK 3.3 release notes for more details.
53-
*/
54-
@Deprecated
55-
@Argument(fullName="recover-dangling-heads", doc="This argument is deprecated since version 3.3", optional = true)
56-
public boolean DEPRECATED_RecoverDanglingHeads = false;
57-
58-
/**
59-
* By default, the read threading assembler will attempt to recover dangling heads and tails. See the `minDanglingBranchLength` argument documentation for more details.
60-
*/
61-
@Hidden
62-
@Argument(fullName="do-not-recover-dangling-branches", doc="Disable dangling head and tail recovery", optional = true)
63-
public boolean doNotRecoverDanglingBranches = false;
64-
6552
/**
6653
* When constructing the assembly graph we are often left with "dangling" branches. The assembly engine attempts to rescue these branches
6754
* by merging them back into the main graph. This argument describes the minimum length of a dangling branch needed for the engine to
@@ -71,13 +58,7 @@ public final class ReadThreadingAssemblerArgumentCollection implements Serializa
7158
@Argument(fullName="min-dangling-branch-length", doc="Minimum length of a dangling branch to attempt recovery", optional = true)
7259
public int minDanglingBranchLength = 4;
7360

74-
/**
75-
* This argument is specifically intended for 1000G consensus analysis mode. Setting this flag will inject all
76-
* provided alleles to the assembly graph but will not forcibly genotype all of them.
77-
*/
78-
@Advanced
79-
@Argument(fullName="consensus", doc="1000G consensus mode", optional = true)
80-
public boolean consensusMode = false;
61+
8162

8263
/**
8364
* The assembly graph can be quite complex, and could imply a very large number of possible haplotypes. Each haplotype
@@ -91,13 +72,6 @@ public final class ReadThreadingAssemblerArgumentCollection implements Serializa
9172
@Argument(fullName="max-num-haplotypes-in-population", doc="Maximum number of haplotypes to consider for your population", optional = true)
9273
public int maxNumHaplotypesInPopulation = 128;
9374

94-
/**
95-
* Enabling this argument may cause fundamental problems with the assembly graph itself.
96-
*/
97-
@Hidden
98-
@Argument(fullName="error-correct-kmers", doc = "Use an exploratory algorithm to error correct the kmers used during assembly", optional = true)
99-
public boolean errorCorrectKmers = false;
100-
10175
/**
10276
* Paths with fewer supporting kmers than the specified threshold will be pruned from the graph.
10377
*
@@ -111,6 +85,28 @@ public final class ReadThreadingAssemblerArgumentCollection implements Serializa
11185
@Argument(fullName="min-pruning", doc = "Minimum support to not prune paths in the graph", optional = true)
11286
public int minPruneFactor = 2;
11387

88+
/**
89+
* Initial base error rate guess for the probabilistic adaptive pruning model. Results are not very sensitive to this
90+
* parameter because it is only a starting point from which the algorithm discovers the true error rate.
91+
*/
92+
@Advanced
93+
@Argument(fullName="adaptive-pruning-initial-error-rate", doc = "Initial base error rate estimate for adaptive pruning", optional = true)
94+
public double initialErrorRateForPruning = 0.001;
95+
96+
/**
97+
* Log-10 likelihood ratio threshold for adaptive pruning algorithm.
98+
*/
99+
@Advanced
100+
@Argument(fullName="pruning-lod-threshold", doc = "Log-10 likelihood ratio threshold for adaptive pruning algorithm", optional = true)
101+
public double pruningLog10OddsThreshold = 1.0;
102+
103+
/**
104+
* The maximum number of variants in graph the adaptive pruner will allow
105+
*/
106+
@Advanced
107+
@Argument(fullName="max-unpruned-variants", doc = "Maximum number of variants in graph the adaptive pruner will allow", optional = true)
108+
public int maxUnprunedVariants = 100;
109+
114110
@Hidden
115111
@Argument(fullName="debug-graph-transformations", doc="Write DOT formatted graph files out of the assembler for only this graph size", optional = true)
116112
public boolean debugGraphTransformations = false;
@@ -137,4 +133,8 @@ public final class ReadThreadingAssemblerArgumentCollection implements Serializa
137133
@Hidden
138134
@Argument(fullName="min-observations-for-kmer-to-be-solid", doc = "A k-mer must be seen at least these times for it considered to be solid", optional = true)
139135
public int minObservationsForKmerToBeSolid = 20;
136+
137+
public abstract ReadThreadingAssembler makeReadThreadingAssembler();
138+
139+
public boolean consensusMode() { return false; }
140140
}

0 commit comments

Comments
 (0)