-
Notifications
You must be signed in to change notification settings - Fork 602
Fix bug in MannWhitneyU on some JVMs. #5371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -2,6 +2,7 @@ | |
|
||
import htsjdk.samtools.util.Histogram; | ||
import org.apache.commons.math3.distribution.NormalDistribution; | ||
import org.apache.commons.math3.util.FastMath; | ||
import org.apache.log4j.Logger; | ||
|
||
|
||
|
@@ -484,7 +485,10 @@ Set<List<Integer>> getPermutations(final Integer[] listToPermute, int numOfPermu | |
* @return P-value based on histogram with u calculated for every possible permutation of group tag. | ||
*/ | ||
public double permutationTest(final double[] series1, final double[] series2, final double testStatU) { | ||
final Histogram<Double> histo = new Histogram<>(); | ||
|
||
// note that Mann-Whitney U stats are always integer or half-integer (this is true even in the case of ties) | ||
// thus for safety we store a histogram of twice the Mann-Whitney values | ||
final Histogram<Long> histo = new Histogram<>(); | ||
final int n1 = series1.length; | ||
final int n2 = series2.length; | ||
|
||
|
@@ -525,7 +529,7 @@ public double permutationTest(final double[] series1, final double[] series2, fi | |
assert (series2End == n2); | ||
|
||
double newU = MathUtils.sum(newSeries1) - ((n1 * (n1 + 1)) / 2.0); | ||
histo.increment(newU); | ||
histo.increment(FastMath.round(2 * newU)); | ||
} | ||
|
||
/** | ||
|
@@ -534,10 +538,10 @@ public double permutationTest(final double[] series1, final double[] series2, fi | |
* and dividing by the total count of everything in the histogram. Just using getCumulativeDistribution() gives | ||
* a p-value of 1 in the most extreme case which doesn't result in a usable z-score. | ||
*/ | ||
double sumOfAllSmallerBins = histo.get(testStatU).getValue() / 2.0; | ||
double sumOfAllSmallerBins = histo.get(FastMath.round(2 * testStatU)).getValue() / 2.0; | ||
|
||
for (final Histogram.Bin<Double> bin : histo.values()) { | ||
if (bin.getId() < testStatU) sumOfAllSmallerBins += bin.getValue(); | ||
for (final Histogram.Bin<Long> bin : histo.values()) { | ||
if (bin.getId() < FastMath.round(2 * testStatU)) sumOfAllSmallerBins += bin.getValue(); | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm probably missing something here, but can you explain why There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Because the value is just the count, and we only modified the key. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Ah yes, I got confused because the code above is purposefully taking "half of the count in the observed bin". 👍 merge away |
||
} | ||
|
||
return sumOfAllSmallerBins / histo.getCount(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just checking, but you shouldn't need to round, is that right?
2 * newU
should always be an integer so casting or floor would also work here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm worried that with finite precision
2 * newU
would end up slightly below the correct integer, in which case casting or floor would give the wrong value. And maybe this isn't supposed to happen, but I'm paranoid because this whole bug is from somebody's JVM doing things that the JVM isn't supposed to do.