Skip to content

Fix regression in Array.Sort for floats/doubles #37941

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 3 commits into from
Jun 28, 2020

Conversation

stephentoub
Copy link
Member

@stephentoub stephentoub commented Jun 16, 2020

Several months back we moved the sorting logic for primtive types out of native code into managed. Doing so helped to make the logic reusable for spans and helped to reduce GC latency, and also actually helped with throughput in a variety of cases. But it ended up regressing throughput for sorting larger arrays of floating-point values, with float/double.CompareTo not getting inlined, and even if it were inlined, containing much more logic than was present in the native implementation. The native implementation did a pre-pass to move all NaNs to the front and then just used simple < and > comparison operations, so the managed implementation now does as well.

With the exception of a large array of already-sorted Int32 values where there is still a small regression after this PR, all of the cases I've tested are either as good or better than .NET Core 3.1.

@jkotas, @GrabYourPitchforks, @tannergooding, thanks for your offline suggestions on approaches here; I tried out a variety of them, including vectorized float/double.CompareTo as well as unsafe casts to wrapper types with customized IComparable implementations, and this ended up being the best overall. Thanks as well to @nietras for pointing out the regression.

Benchmark:

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Diagnostics.CodeAnalysis;
using System.Linq;

[MemoryDiagnoser]
public class Program
{
    static void Main(string[] args) => BenchmarkSwitcher.FromAssemblies(new[] { typeof(Program).Assembly }).Run(args);
}

public class DoubleSorting : Sorting<double> { protected override double GetNext() => _random.Next(); }
public class Int32Sorting : Sorting<int> { protected override int GetNext() => _random.Next(); }
public class StringSorting : Sorting<string>
{
    protected override string GetNext() => string.Create(_random.Next(1, 5), _random, (dest, r) =>
    {
        for (int i = 0; i < dest.Length; i++) dest[i] = (char)('a' + r.Next(26));
    });
}

public abstract class Sorting<T>
{
    protected Random _random;
    private T[] _orig, _array;

    [Params(10, 100_000)]
    public int Size { get; set; }

    protected abstract T GetNext();

    [GlobalSetup]
    public void Setup()
    {
        _random = new Random(42);
        _orig = Enumerable.Range(0, Size).Select(_ => GetNext()).ToArray();
        _array = (T[])_orig.Clone();
        Array.Sort(_array);
    }

    [Benchmark]
    public void Sorted() => Array.Sort(_array);

    [Benchmark]
    public void Random()
    {
        _orig.AsSpan().CopyTo(_array);
        Array.Sort(_array);
    }
}
Type Method Toolchain Size Mean Ratio
DoubleSorting Sorted netcore31 10 59.58 ns 2.13
DoubleSorting Sorted master 10 30.52 ns 1.09
DoubleSorting Sorted pr 10 28.01 ns 1.00
DoubleSorting Random netcore31 10 77.50 ns 1.71
DoubleSorting Random master 10 67.61 ns 1.49
DoubleSorting Random pr 10 45.44 ns 1.00
DoubleSorting Sorted netcore31 100000 990,325.98 ns 1.27
DoubleSorting Sorted master 100000 2,898,290.96 ns 3.73
DoubleSorting Sorted pr 100000 777,209.08 ns 1.00
DoubleSorting Random netcore31 100000 5,940,056.30 ns 1.08
DoubleSorting Random master 100000 7,880,560.62 ns 1.44
DoubleSorting Random pr 100000 5,473,335.58 ns 1.00
Int32Sorting Sorted netcore31 10 38.02 ns 2.31
Int32Sorting Sorted master 10 17.09 ns 1.04
Int32Sorting Sorted pr 10 16.49 ns 1.00
Int32Sorting Random netcore31 10 49.97 ns 1.63
Int32Sorting Random master 10 31.30 ns 1.02
Int32Sorting Random pr 10 30.63 ns 1.00
Int32Sorting Sorted netcore31 100000 572,908.37 ns 0.90
Int32Sorting Sorted master 100000 640,966.00 ns 1.00
Int32Sorting Sorted pr 100000 639,118.53 ns 1.00
Int32Sorting Random netcore31 100000 5,072,236.56 ns 1.06
Int32Sorting Random master 100000 5,024,276.17 ns 1.05
Int32Sorting Random pr 100000 4,801,833.82 ns 1.00
StringSorting Sorted netcore31 10 575.87 ns 1.24
StringSorting Sorted master 10 432.40 ns 0.93
StringSorting Sorted pr 10 465.86 ns 1.00
StringSorting Random netcore31 10 1,758.64 ns 1.15
StringSorting Random master 10 1,425.75 ns 0.93
StringSorting Random pr 10 1,532.01 ns 1.00
StringSorting Sorted netcore31 100000 83,774,554.44 ns 1.14
StringSorting Sorted master 100000 74,485,247.25 ns 1.01
StringSorting Sorted pr 100000 73,450,518.68 ns 1.00
StringSorting Random netcore31 100000 103,108,672.31 ns 1.14
StringSorting Random master 100000 89,373,058.33 ns 0.99
StringSorting Random pr 100000 90,577,543.59 ns 1.00

@stephentoub stephentoub added this to the 5.0.0 milestone Jun 16, 2020
Copy link
Member

@jkotas jkotas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

@stephentoub stephentoub reopened this Jun 16, 2020
@stephentoub stephentoub reopened this Jun 16, 2020
@stephentoub stephentoub reopened this Jun 17, 2020
@stephentoub stephentoub force-pushed the fixsortperf_utilslessthanref branch from 046f85c to 07dcfe1 Compare June 17, 2020 02:27
@stephentoub stephentoub reopened this Jun 17, 2020
@stephentoub stephentoub reopened this Jun 17, 2020
@stephentoub stephentoub force-pushed the fixsortperf_utilslessthanref branch from 07dcfe1 to 04e1399 Compare June 17, 2020 09:43
@nietras
Copy link
Contributor

nietras commented Jun 17, 2020

@stephentoub perhaps consider testing performance of following type too:

    public class ComparableClassInt32 
        : IComparable<ComparableClassInt32>
    {
        public readonly int Value;

        public ComparableClassInt32(int value) =>
            Value = value;

        public int CompareTo(ComparableClassInt32 other) => 
            Value.CompareTo(other.Value);
    }

basic reference type overhead. Since string compares are "slow" you won't see possible reference type regressions for this. Just a suggestion :)

@stephentoub
Copy link
Member Author

Just a suggestion :)

A knowing smile? 😉 Yes, this case regresses. It appears to be due to dictionary lookups when calling LessThan/GreaterThan, which also prevent inlining. Evaluating options...

@nietras
Copy link
Contributor

nietras commented Jun 17, 2020

knowing smile? 😉

Ha yeah 😉 Back in 2018 I went through all "stages of inlining": huh?, wuhuu! (AggressiveInlining), doh! (reference type regression), oh come on! (JIT-me-not issues) 😅

Looking forward to what you come up with. 😀

@stephentoub stephentoub added the NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) label Jun 18, 2020
@stephentoub
Copy link
Member Author

#38229 will hopefully be the solution here.

@nietras
Copy link
Contributor

nietras commented Jun 22, 2020

Nice 👍 Now if #35791/#10048 were resolved too, we would have something to talk about 😉 Happy to help with that if I could get some pointers 😀

@AndyAyersMS
Copy link
Member

Happy to help with that if I could get some pointers

I'm certainly open to reconsidering #10048. The changes are simple enough, though I might end up restricting it to AggressiveInlining callees since the jit underestimates the size impact of a delegate invoke.

But I still don't have a clear picture of how allowing this actually provides benefit -- it sounds like you think with this we can unify some code and either make it perform better or at least not lose any perf. So perhaps a benchmark along these lines would be instructive?

@stephentoub
Copy link
Member Author

it sounds like you think with this we can unify some code and either make it perform better or at least not lose any perf.

Not to put words in @nietras mouth, but I expect what he's hoping to do is improve the Array.Sort(..., IComparer<T>) code paths. Today we have one sort implementation that covers both providing an IComparer<T> and a Comparison<T> (a delegate); the former is implemented by creating a delegate to its Compare method, which means Array.Sort(..., IComparer<T>) is allocating. On top of that, we just recently added a span-based sort, and it's currently defined with a generic TComparer : IComparer<T> comparer, with the idea that you could provide a struct-based comparer and it wouldn't allocate... unfortunately, it's the worst of all worlds right now, in that we box the TComparer into an IComparer<T>, and then create a delegate from its Compare method. If all of these code quality issues got sorted out, you could imagine we just had a single code path for TComparer that was written generically to use its Compare sans boxing, and then if you used the Comparison<T> overload, we'd create a struct-based TComparer that just wrapped that Comparison<T> to invoke it. If we everything worked out really well, it could potentially even be unified with the separate, duplicate code path we currently have that targets T : IComparable<T>, using a TComparer comparer struct that delegated to the T : IComparable<T> implementation.

So perhaps a benchmark along these lines would be instructive?

Sounds like a good thing to add to dotnet/performance.

@nietras
Copy link
Contributor

nietras commented Jun 23, 2020

If all of these code quality issues got sorted out, you could imagine we just had a single code path for TComparer that was written generically to use its Compare sans boxing, and then if you used the Comparison overload, we'd create a struct-based TComparer that just wrapped that Comparison to invoke it. If we everything worked out really well, it could potentially even be unified with the separate, duplicate code path we currently have that targets T : IComparable, using a TComparer comparer struct that delegated to the T : IComparable implementation.

@stephentoub exactly. :) Although, I am not sure we could unify on TComparer as such, but potentially yes. This would be the fulfillment of the API I have proposed and the on/off work I have been doing on this for the last 3 years... failing due to the many issues around inlining. At the very minimum we could replace the Comparison<T> path with the TComparer path without duplicating code.

Note this is not just about sorting. Sorting, however, for me is a good example of where .NET comes short. I use the value type as a inlineable "functor" pattern for data processing algorithms. Think loops over millions of elements. Unfortunately, we can't unify on this pattern due to these kinds of issues, which Sort exemplifies very well. Lots of other devs use this pattern, but we often end up having to duplicate code, and since this is code that is combinatorial on rank, different kinds of transformations etc. it adds up. It's a lot of replicated code. I have talked about this before and don't want to sound like a broken record player 😅

with the idea that you could provide a struct-based comparer and it wouldn't allocate...

It's not just the allocation. It's so the compare can be inlined. So the "functor" can be applied inlined. To yield a customized loop for performance. As you probably know.

perhaps a benchmark along these lines would be instructive?

@AndyAyersMS I don't know what kind of benchmark you are thinking about but something simple like below shows the issue.

using BenchmarkDotNet.Attributes;
using BenchmarkDotNet.Running;
using System;
using System.Collections.Generic;
using System.Runtime.CompilerServices;

namespace CompareBenchmarking
{
    public class Program
    {
        static void Main(string[] args) => 
            BenchmarkSwitcher.FromAssemblies(new[] { typeof(Program).Assembly }).Run(args);
    }

    public class CompareFloat : Compare<float> { protected override float GetNext() => _random.Next(); }
    public class CompareInt32 : Compare<int> { protected override int GetNext() => _random.Next(); }

    public struct ComparisonComparer<T> : IComparer<T>
    {
        readonly Comparison<T> _comparison;

        public ComparisonComparer(Comparison<T> comparison) =>
            _comparison = comparison;

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public int Compare(T x, T y) => _comparison(x, y);
    }

    [MemoryDiagnoser]
    [DisassemblyDiagnoser]
    public abstract class Compare<T>
        where T : IComparable<T>
    {
        static readonly Comparer<T> _comparer = Comparer<T>.Default;
        static readonly Comparison<T> _comparison = Comparer<T>.Default.Compare;
        readonly ComparisonComparer<T> _comparisonComparer = new ComparisonComparer<T>(_comparison);

        protected Random _random;
        T _x;
        T _y;

        protected abstract T GetNext();

        [GlobalSetup]
        public void Setup()
        {
            _random = new Random(42);
            _x = GetNext();
            _y = GetNext();
        }

        [Benchmark]
        public int CompareTo() => _x.CompareTo(_y);

        [Benchmark]
        public int Comparer() => _comparer.Compare(_x, _y);

        [Benchmark(Baseline = true)]
        public int Comparison() => _comparison(_x, _y);

        [Benchmark]
        public int ComparisonComparer() => _comparisonComparer.Compare(_x, _y);
    }
}

With the following results on .NET 5.0 Preview 2. The factor of 2.31x pretty much says it all although that would be pretty self-evident given the extra indirection and code generation issues. Just imagine this in a tight loop. :)

BenchmarkDotNet=v0.12.1, OS=Windows 10.0.19041.329 (2004/?/20H1)
Intel Core i7-8700 CPU 3.20GHz (Coffee Lake), 1 CPU, 12 logical and 6 physical cores
.NET Core SDK=5.0.100-preview.2.20176.6
  [Host]     : .NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT
  DefaultJob : .NET Core 5.0.0 (CoreCLR 5.0.20.16006, CoreFX 5.0.20.16006), X64 RyuJIT

CompareInt32

Method Mean Error StdDev Ratio
CompareTo 0.4995 ns 0.0051 ns 0.0048 ns 0.38
Comparer 1.1854 ns 0.0032 ns 0.0028 ns 0.89
Comparison 1.3258 ns 0.0065 ns 0.0054 ns 1.00
ComparisonComparer 3.0613 ns 0.0096 ns 0.0089 ns 2.31

CompareFloat

Method Mean Error StdDev Ratio RatioSD
CompareTo 1.237 ns 0.0018 ns 0.0015 ns 0.60 0.00
Comparer 1.905 ns 0.0628 ns 0.0557 ns 0.93 0.03
Comparison 2.060 ns 0.0047 ns 0.0039 ns 1.00 0.00
ComparisonComparer 3.479 ns 0.0096 ns 0.0090 ns 1.69 0.01

If this is interesting I can make a PR to the benchmark repo.

@stephentoub stephentoub force-pushed the fixsortperf_utilslessthanref branch from 04e1399 to 356828c Compare June 25, 2020 22:27
@stephentoub stephentoub force-pushed the fixsortperf_utilslessthanref branch from 356828c to a3e3d79 Compare June 26, 2020 21:34
AndyAyersMS added a commit to AndyAyersMS/runtime that referenced this pull request Jun 26, 2020
RyuJit would not inline methods that contained delegate invokes. Remove
this limitation.

Closes dotnet#10048. See also dotnet#37941.
jkotas pushed a commit that referenced this pull request Jun 27, 2020
RyuJit would not inline methods that contained delegate invokes. Remove
this limitation.

Closes #10048. See also #37941.
Several months back we moved the sorting logic for primtive types out of native code into managed.  Doing so helped to make the logic reusable for spans and helped to reduce GC latency, and also actually helped with throughput in a variety of cases.  But it ended up regressing throughput for sorting larger arrays of floating-point values, with float/double.CompareTo not getting inlined, and even if it were inlined, containing much more logic than was present in the native implementation.  The native implementation did a pre-pass to move all NaNs to the front and then just used simple < and > comparison operations, so the managed implementation now does as well.
@stephentoub stephentoub force-pushed the fixsortperf_utilslessthanref branch from a3e3d79 to 2bec1bf Compare June 27, 2020 23:57
@stephentoub
Copy link
Member Author

I moved the comparer functions into the two generic helper classes, and with #38229 (thanks, @jkotas), the reference-type-Int32-wrapper case is good again, with everything else being appx what it was before as well.

I'll merge when this is green.

@stephentoub stephentoub merged commit e1c9ab4 into dotnet:master Jun 28, 2020
@stephentoub stephentoub deleted the fixsortperf_utilslessthanref branch June 28, 2020 01:57
@nietras
Copy link
Contributor

nietras commented Jul 16, 2020

@stephentoub should I try make a PR for the TComparer change? Is there interest in getting this in? The change itself to TComparer is easy enough, it's the whole sort helpers creation etc. that needs to change a lot, and to support reference type delegate comparer will likely require a "unsafe" cast of Comparison<T> to Comparison<object>. Is that acceptable?

Also thank you for the mention in your awesome Performance Improvements in .NET 5 :)

@jkotas
Copy link
Member

jkotas commented Jul 16, 2020

Hi @nietras,

@stephentoub is OOF.

I think it would be a good idea to start the PR for the TComparer changes so that we can start sorting out the code quality issues that it is likely to expose. Preferably, we would fix them instead of working around them by duplicating large amounts of code.

I have opened #39466 to have this tracked. For time line, I do not expect we would be able to get this change into .NET 5.

"unsafe" cast of Comparison<T> to Comparison<object>

Why would that be needed?

I am wondering whether it would make sense to delete the TComparer overloads from the public surface for now until we can get the right implementation for them in place. I think they can stay, but just wanted to bring it up.

cc @eiriktsarpalis

@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Runtime NO-MERGE The PR is not ready for merge yet (see discussion for detailed reasons) tenet-performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants