-
Notifications
You must be signed in to change notification settings - Fork 602
Fix ReservoirDownsampler, PositionalDownsampler, and ReadsDownsamplingIterator. #5594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 1 commit
222332c
1933ea2
f12fc3d
b807bb8
aa2ca23
25b7e64
f1b5a19
efce411
4b4e30b
29bba52
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,15 @@ | ||
package org.broadinstitute.hellbender.utils.downsampling; | ||
|
||
import htsjdk.samtools.SAMFileHeader; | ||
import org.broadinstitute.hellbender.exceptions.GATKException; | ||
import org.broadinstitute.hellbender.utils.Utils; | ||
import org.broadinstitute.hellbender.utils.read.GATKRead; | ||
import org.broadinstitute.hellbender.utils.read.ReadCoordinateComparator; | ||
import org.broadinstitute.hellbender.utils.read.ReadUtils; | ||
|
||
import java.util.ArrayList; | ||
import java.util.List; | ||
import java.util.Optional; | ||
|
||
|
||
/** | ||
|
@@ -71,16 +73,31 @@ private void handlePositionalChange( final GATKRead newRead ) { | |
// Use ReadCoordinateComparator to determine whether we've moved to a new start position. | ||
// ReadCoordinateComparator will correctly distinguish between purely unmapped reads and unmapped reads that | ||
// are assigned a nominal position. | ||
if ( previousRead != null && ReadCoordinateComparator.compareCoordinates(previousRead, newRead, header) != 0 ) { | ||
if ( reservoir.hasFinalizedItems() ) { | ||
finalizeReservoir(); | ||
if ( previousRead != null) { | ||
final int cmpDiff = ReadCoordinateComparator.compareCoordinates(previousRead, newRead, header); | ||
if (cmpDiff == 1) { | ||
throw new GATKException.ShouldNeverReachHereException( | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This shouldn't be a There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right. BTW, I originally added this check because the |
||
String.format("Reads must be coordinate sorted (earlier %s later %s)", previousRead, newRead)); | ||
} | ||
if (cmpDiff != 0) { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
reservoir.signalEndOfInput(); | ||
if ( reservoir.hasFinalizedItems() ) { | ||
finalizeReservoir(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If the reservoir somehow had no finalized items after a positional change, wouldn't it get stuck in an "end of stream" state, since you're now calling signalEndOfInput() here? And wouldn't it then just explode on the subsequent reservoir.submit() call? Perhaps we should just unconditionally call into There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think it should be possible to have a position change and not have any finalized items, so the |
||
} | ||
} | ||
} | ||
} | ||
|
||
private void finalizeReservoir() { | ||
// We can't consume finalized reads from the reservoir unless we first signal EOI. | ||
// Once signalEndOfInput has been called and propagated to the ReservoirDownsampler, consumeFinalizedItems | ||
// must be called on the ReservoirDownsampler before any new items can be submitted to it, to reset it's | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done |
||
// state so it can be recycled/reused for the next downsampling position. | ||
reservoir.signalEndOfInput(); | ||
finalizedReads.addAll(reservoir.consumeFinalizedItems()); | ||
reservoir.clearItems(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The call above to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Right. |
||
reservoir.resetStats(); | ||
previousRead = null; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is it necessary to set There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, see my response comment above. |
||
} | ||
|
||
@Override | ||
|
@@ -97,9 +114,11 @@ public List<GATKRead> consumeFinalizedItems() { | |
|
||
@Override | ||
public boolean hasPendingItems() { | ||
// The finalized items in the ReservoirDownsampler are pending items from the perspective of the | ||
// enclosing PositionalDownsampler | ||
return reservoir.hasFinalizedItems(); | ||
// The ReservoirDownsampler accumulates pending items until signalEndOfInput has been called, at which | ||
// point all items that have survived downsampling become finalized. From the perspective of the enclosing | ||
// PositionalDownsampler, both finalized items and pending items in the ReservoirDownsampler are considered | ||
// pending. | ||
return reservoir.hasFinalizedItems() || reservoir.hasPendingItems(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems right. |
||
} | ||
|
||
@Override | ||
|
@@ -109,9 +128,11 @@ public GATKRead peekFinalized() { | |
|
||
@Override | ||
public GATKRead peekPending() { | ||
// The finalized items in the ReservoirDownsampler are pending items from the perspective of the | ||
// enclosing PositionalDownsampler | ||
return reservoir.peekFinalized(); | ||
// The ReservoirDownsampler accumulates pending items until signalEndOfInput has been called, at which | ||
// point all items that have survived downsampling become finalized. From the perspective of the enclosing | ||
// PositionalDownsampler, both finalized items and pending items in the ReservoirDownsampler are considered | ||
// pending. | ||
return Optional.ofNullable(reservoir.peekFinalized()).orElse(reservoir.peekPending()); | ||
} | ||
|
||
@Override | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. When There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I think it should, with the same protocol (either |
||
|
@@ -139,6 +160,7 @@ public boolean requiresCoordinateSortOrder() { | |
|
||
@Override | ||
public void signalNoMoreReadsBefore( final GATKRead read ) { | ||
Utils.nonNull(read, "Positional downsampler requires non-null reads"); | ||
handlePositionalChange(read); | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -51,6 +51,13 @@ public final class ReservoirDownsampler extends ReadsDownsampler { | |
*/ | ||
private int totalReadsSeen; | ||
|
||
/** | ||
* In order to guarantee that all reads have equal probability of being discarded, we need to have consumed the | ||
* entire input stream before any items can become finalized. All submitted items (that survive downsampling) | ||
* remain pending until endOfInputStream is called, at which point they become finalized. | ||
*/ | ||
private boolean endOfInputStream; | ||
|
||
|
||
/** | ||
* Construct a ReservoirDownsampler | ||
|
@@ -88,6 +95,9 @@ public ReservoirDownsampler(final int targetSampleSize ) { | |
@Override | ||
public void submit ( final GATKRead newRead ) { | ||
Utils.nonNull(newRead, "newRead"); | ||
// Once the end of the input stream has been seen, consumeFinalizedItems must be called to reset the state | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
// of the ReservoirDownsampler before more items can be submitted | ||
Utils.validate(endOfInputStream == false, "attempt to submit read after end of input stream has been seen"); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
|
||
// Only count reads that are actually eligible for discarding for the purposes of the reservoir downsampling algorithm | ||
totalReadsSeen++; | ||
|
@@ -110,35 +120,39 @@ public void submit ( final GATKRead newRead ) { | |
|
||
@Override | ||
public boolean hasFinalizedItems() { | ||
return ! reservoir.isEmpty(); | ||
// All items in the reservoir are pending until endOfInputStream is seen, at which point all items become finalized | ||
return endOfInputStream && !reservoir.isEmpty(); | ||
} | ||
|
||
@Override | ||
public List<GATKRead> consumeFinalizedItems() { | ||
Utils.validate(endOfInputStream == true, "signalEndOfInput must be called before finalized items can be consumed"); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you suggesting that this There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, I think it should be removed. The contract of |
||
if (hasFinalizedItems()) { | ||
// pass reservoir by reference rather than make a copy, for speed | ||
final List<GATKRead> downsampledItems = reservoir; | ||
clearItems(); | ||
return downsampledItems; | ||
} else { | ||
// if there's nothing here, don't bother allocating a new list | ||
clearItems(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. See above -- calling There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This if-branch is the case where end of input has been signaled, but no items were ever submitted. Removing the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. At some point we should consider adding a method to the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Most downsamplers don't get reused after |
||
return Collections.emptyList(); | ||
} | ||
} | ||
|
||
@Override | ||
public boolean hasPendingItems() { | ||
return false; | ||
// All items in the reservoir are pending until endOfInputStream is seen, at which point all items become finalized | ||
return !endOfInputStream && !reservoir.isEmpty(); | ||
} | ||
|
||
@Override | ||
public GATKRead peekFinalized() { | ||
return reservoir.isEmpty() ? null : reservoir.get(0); | ||
return hasFinalizedItems() ? reservoir.get(0) : null; | ||
} | ||
|
||
@Override | ||
public GATKRead peekPending() { | ||
return null; | ||
return hasPendingItems() ? reservoir.get(0) : null; | ||
} | ||
|
||
@Override | ||
|
@@ -148,7 +162,7 @@ public int size() { | |
|
||
@Override | ||
public void signalEndOfInput() { | ||
// NO-OP | ||
endOfInputStream = true; | ||
} | ||
|
||
/** | ||
|
@@ -164,6 +178,8 @@ public void clearItems() { | |
|
||
// an internal stat used by the downsampling process, so not cleared by resetStats() below | ||
totalReadsSeen = 0; | ||
|
||
endOfInputStream = false; | ||
} | ||
|
||
@Override | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -9,6 +9,7 @@ | |
import org.testng.annotations.Test; | ||
|
||
import java.util.*; | ||
import java.util.stream.IntStream; | ||
|
||
public class ReadsDownsamplingIteratorUnitTest extends GATKBaseTest { | ||
|
||
|
@@ -132,6 +133,22 @@ public void testRemoveThrows() { | |
iter.remove(); | ||
} | ||
|
||
@Test | ||
public void testReadsDownsamplingIteratorWithReservoirDownsampler() { | ||
final int TOTAL_READ_COUNT = 100; | ||
final int TARGET_COVERAGE = 45; | ||
final List<GATKRead> reads = new ArrayList<>(TOTAL_READ_COUNT); | ||
IntStream.range(1, TOTAL_READ_COUNT).forEach(i -> reads.add(readWithName(Integer.toString(i)))); | ||
|
||
final List<GATKRead> downsampledReads = new ArrayList<>(); | ||
final ReadsDownsamplingIterator downsamplingIter = new ReadsDownsamplingIterator(reads.iterator(), new ReservoirDownsampler(TARGET_COVERAGE)); | ||
for ( final GATKRead read : downsamplingIter ) { | ||
downsampledReads.add(read); | ||
} | ||
|
||
Assert.assertEquals(downsampledReads.size(), TARGET_COVERAGE); | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can I suggest that you also modify the existing test
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah good idea. Done. |
||
|
||
private GATKRead readWithName( final String name ) { | ||
return ArtificialReadUtils.createArtificialRead(TextCigarCodec.decode("10M"), name); | ||
} | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -66,24 +66,26 @@ public void testReservoirDownsampler(final ReservoirDownsamplerTest test ) { | |
|
||
downsampler.submit(test.createReads()); | ||
|
||
// after submit, but before signalEndOfInput, all reads are pending, none are finalized | ||
if ( test.totalReads > 0 ) { | ||
Assert.assertTrue(downsampler.hasFinalizedItems()); | ||
Assert.assertTrue(downsampler.peekFinalized() != null); | ||
Assert.assertFalse(downsampler.hasPendingItems()); | ||
Assert.assertTrue(downsampler.peekPending() == null); | ||
Assert.assertFalse(downsampler.hasFinalizedItems()); | ||
Assert.assertNull(downsampler.peekFinalized()); | ||
Assert.assertTrue(downsampler.hasPendingItems()); | ||
Assert.assertNotNull(downsampler.peekPending()); | ||
} | ||
else { | ||
Assert.assertFalse(downsampler.hasFinalizedItems() || downsampler.hasPendingItems()); | ||
Assert.assertTrue(downsampler.peekFinalized() == null && downsampler.peekPending() == null); | ||
} | ||
|
||
// after signalEndOfInput, not reads are pending, all are finalized | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
downsampler.signalEndOfInput(); | ||
|
||
if ( test.totalReads > 0 ) { | ||
Assert.assertTrue(downsampler.hasFinalizedItems()); | ||
Assert.assertTrue(downsampler.peekFinalized() != null); | ||
Assert.assertNotNull(downsampler.peekFinalized()); | ||
Assert.assertFalse(downsampler.hasPendingItems()); | ||
Assert.assertTrue(downsampler.peekPending() == null); | ||
Assert.assertNull(downsampler.peekPending()); | ||
} | ||
else { | ||
Assert.assertFalse(downsampler.hasFinalizedItems() || downsampler.hasPendingItems()); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SamplePartitioner
also uses aReservoirDownsampler
directly -- did you check the usage there to ensure it's consistent with the new semantics?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did, and just did so again to make sure- it looks like it follows the protocol pretty well and reliably calls
signalEndOfInput
before consuming items.