-
Notifications
You must be signed in to change notification settings - Fork 603
Added trimming to the Allele removal code in the genotyping engine #6044
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added trimming to the Allele removal code in the genotyping engine #6044
Conversation
if( call != null ) { | ||
|
||
readAlleleLikelihoods = prepareReadAlleleLikelihoodsForAnnotation(readLikelihoods, perSampleFilteredReadList, | ||
emitReferenceConfidence, alleleMapper, readAlleleLikelihoods, call); | ||
|
||
final VariantContext annotatedCall = makeAnnotatedCall(ref, refLoc, tracker, header, mergedVC, readAlleleLikelihoods, call); | ||
final VariantContext annotatedCall = makeAnnotatedCall(ref, refLoc, tracker, header, mergedVC, readAlleleLikelihoods, call, annotationEngine); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: this uses the mergedVC rather than the culledVC which in turn is used to construct the reference context used for annotation.
…handled unsafe Allele list operations
Codecov Report
@@ Coverage Diff @@
## master #6044 +/- ##
===============================================
+ Coverage 87.206% 87.207% +0.001%
- Complexity 32722 32724 +2
===============================================
Files 2011 2011
Lines 150967 150981 +14
Branches 16134 16134
===============================================
+ Hits 131653 131666 +13
Misses 13702 13702
- Partials 5612 5613 +1
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery These changes fix the issue but if I understand correctly (and I might not) I think you may be able to rearrange things for clarity.
I noticed that you need to explicitly keep track of mergedAlleles
more than you might want -- it's not final
, you need mergedAlleles++
when added NON_REF_ALLELE
, and you have to pass it to makeAnnotatedCall
. My first instinct was "wait, why isn't this equal to mergedVC.getAlleles().size()
", but I soon realized that this was the old buggy code. Nonetheless, it would be really nice if that old buggy code worked. That is, call.getAlleles().size() == mergedVC.getAlleles().size()
really should be the criterion for trimming.
Now, if I understand right, the reason this fails is that about 20 lines upstream of makeAnnotatedCall
we have mergedVC = removeAltAllelesIfTooManyGenotypes(ploidy, alleleMapper, mergedVC)
and don't check whether trimming is necessary. I would prefer to check whether trimming is necessary inside removeAltAllelesIfTooManyGenotypes
and inside makeAnnotatedCall
in order to render both methods self-contained. The performance cost, if any, will be completely negligible.
Thank you for the speedy review @davidbenjamin. I agree with you that the obvious place to trim the alleles is in Another proposal would have been to just hold onto the To clarify, it is possible to move the check to the right place its likely to force me to write a non-library implementation of the trimming code that tracks what edits it made and I was trying to avoid doing that. |
@jamesemery In that case, I give a 👍 as far as correctness is concerned and I will leave it to the engine team to sort out the software engineering trade-offs. You can merge whenever you are satisfied. |
@@ -167,6 +168,7 @@ public CalledHaplotypes assignGenotypeLikelihoods(final List<Haplotype> haplotyp | |||
loc); | |||
|
|||
VariantContext mergedVC = AssemblyBasedCallerUtils.makeMergedVariantContext(eventsAtThisLocWithSpanDelsReplaced); | |||
int mergedAlleles = mergedVC.getAlleles().size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable name isn't helpful. Maybe something like "numberOfAllelesBeforeSubsetting" would be clearer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jamesemery I think there's a newly introduced NPE here. I have a bunch of comments about minor stuff too.
I might actually just pull the trimming out of this method and do it outside of the annotating.
@@ -365,14 +368,15 @@ static VariantContext removeExcessAltAllelesFromVC(final VariantContext inputVC, | |||
return vcb.make(); | |||
} | |||
|
|||
protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call) { | |||
@VisibleForTesting | |||
static protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, int mergedAllelesSize, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call, VariantAnnotatorEngine annotationEngine) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment about the new variable name in this method
@@ -365,14 +368,15 @@ static VariantContext removeExcessAltAllelesFromVC(final VariantContext inputVC, | |||
return vcb.make(); | |||
} | |||
|
|||
protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call) { | |||
@VisibleForTesting | |||
static protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, int mergedAllelesSize, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call, VariantAnnotatorEngine annotationEngine) { | |||
final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd()); | |||
final SimpleInterval refLocInterval= new SimpleInterval(refLoc); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we make a copy of the refloc?
@@ -167,6 +168,7 @@ public CalledHaplotypes assignGenotypeLikelihoods(final List<Haplotype> haplotyp | |||
loc); | |||
|
|||
VariantContext mergedVC = AssemblyBasedCallerUtils.makeMergedVariantContext(eventsAtThisLocWithSpanDelsReplaced); | |||
int mergedAlleles = mergedVC.getAlleles().size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You deference mergedVC here, but immediately afterwards there is a check for mergedVC == null. I think you've introduced an NPE here.
@@ -365,14 +368,15 @@ static VariantContext removeExcessAltAllelesFromVC(final VariantContext inputVC, | |||
return vcb.make(); | |||
} | |||
|
|||
protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call) { | |||
@VisibleForTesting | |||
static protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, int mergedAllelesSize, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call, VariantAnnotatorEngine annotationEngine) { | |||
final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can simplify this:
final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd()); | |
final SimpleInterval locus = new SimpleInterval(mergedVC); |
final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd()); | ||
final SimpleInterval refLocInterval= new SimpleInterval(refLoc); | ||
final ReferenceDataSource refData = new ReferenceMemorySource(new ReferenceBases(ref, refLocInterval), header.getSequenceDictionary()); | ||
final ReferenceContext referenceContext = new ReferenceContext(refData, locus, refLocInterval); | ||
|
||
final VariantContext untrimmedResult = annotationEngine.annotateContext(call, tracker, referenceContext, readAlleleLikelihoods, a -> true); | ||
return call.getAlleles().size() == mergedVC.getAlleles().size() ? untrimmedResult | ||
return call.getAlleles().size() == mergedAllelesSize ? untrimmedResult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be simpler if we passed in a boolean called trimAlleles
. This is also fine but the mergedAllelesSize variable needs to be named something clearer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You might also want to use the untrimmedResult size instead of call since that would be more foolproof against changes in the future.
final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd()); | ||
final SimpleInterval refLocInterval= new SimpleInterval(refLoc); | ||
final ReferenceDataSource refData = new ReferenceMemorySource(new ReferenceBases(ref, refLocInterval), header.getSequenceDictionary()); | ||
final ReferenceContext referenceContext = new ReferenceContext(refData, locus, refLocInterval); | ||
|
||
final VariantContext untrimmedResult = annotationEngine.annotateContext(call, tracker, referenceContext, readAlleleLikelihoods, a -> true); | ||
return call.getAlleles().size() == mergedVC.getAlleles().size() ? untrimmedResult | ||
return call.getAlleles().size() == mergedAllelesSize ? untrimmedResult |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use call.getNAlleles() instead.
Also, I would add a comment explaining why you can't trim after the code that removes alleles... |
@lbergelson responded to your comments |
final SimpleInterval locus = new SimpleInterval(mergedVC.getContig(), mergedVC.getStart(), mergedVC.getEnd()); | ||
@VisibleForTesting | ||
static protected VariantContext makeAnnotatedCall(byte[] ref, SimpleInterval refLoc, FeatureContext tracker, SAMFileHeader header, VariantContext mergedVC, int mergedAllelesListSizeBeforePossibleTrimming, ReadLikelihoods<Allele> readAlleleLikelihoods, VariantContext call, VariantAnnotatorEngine annotationEngine) { | ||
final SimpleInterval locus = new SimpleInterval(mergedVC.getContig()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's with this getContig()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
He approved the PR in his subsequent comment
I have tested that this explicitly works on the users data. I decided it was simplest to just check for mis-trimming at the very last stage. I'm a little weary about the change of the locus for the ref context from being the culledVC to being the mergedVC.
Fixes #5994