Skip to content

Correct double return type for SDOT and Co with Apple Accelerate #65

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
mmuetzel opened this issue Feb 4, 2025 · 15 comments
Open

Correct double return type for SDOT and Co with Apple Accelerate #65

mmuetzel opened this issue Feb 4, 2025 · 15 comments

Comments

@mmuetzel
Copy link
Contributor

mmuetzel commented Feb 4, 2025

Some BLAS functions have the wrong return type in their implementation in Apple Accelerate. E.g., SDOT should return a single-precision floating point value but it returns a double-precision floating point value.
Essentially, all functions that are returning a single-precision floating point value in their standard implementation are affected.

Other implementations of BLAS (e.g., OpenBLAS) follow the standard implementation also on macOS. That makes it difficult to use FlexiBLAS to switch, e.g., between OpenBLAS and Apple Accelerate on macOS.

Would it be possible that FlexiBLAS added some shims that convert the return type for the affected functions from (non-standard) double to float?

See also the discussion here:
https://octave.discourse.group/t/macos-accelerate-framework-single-precision-issue/6140
Or:
https://fortran-lang.discourse.group/t/how-many-blas-libraries-have-this-error/4454

@grisuthedragon
Copy link
Member

First, I would say blame Apple for not following the rules or even using f2c (which introduced this issue) . For example for sdot the way FlexiBLAS provides the function is

float FC_GLOBAL(sdot,SDOT)(blasint* n, float* sx, blasint* incx, float* sy, blasint* incy)

and you need

double FC_GLOBAL(sdot,SDOT)(blasint* n, float* sx, blasint* incx, float* sy, blasint* incy)

Did I get this right?

Since this is an ABI change, the only way I could think of (by now), is to provide a separate flexiblas library and try to add some detection mechanism, when loading the backend, if the single precision routines do the right job.

Can you check if this is the complete list of functions:

  • sdot
  • sdsdot
  • snrm2

Are single precision complex function affected as well?

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Feb 5, 2025

Thank you for considering this.

I agree that this mess is Apple's fault.
I kind of understand how the Apple devs could have accidentally called f2c with the wrong flags at some point in the 70's or 80's. But I don't understand why they haven't fixed that since...

Ideally, FlexiBLAS would expose function prototypes that confirm to the reference implementation even if the underlying implementation has this defect. I.e., FlexiBLAS would continue to expose the function with

float FC_GLOBAL(sdot,SDOT)(blasint* n, float* sx, blasint* incx, float* sy, blasint* incy)

even if the underlying (defective) implementation uses the following (non-conformant) prototype:

double FC_GLOBAL(sdot,SDOT)(blasint* n, float* sx, blasint* incx, float* sy, blasint* incy)

But it would use a small shim function that casts the wrong return type (double) to the correct type (float) so that it looks like there is no issue with Apple Accelerate to programs using these functions.

So, if I understood you correctly, that would be the other way round from what you suggested.

I don't have physical access to a macOS system. But I could ask on the Octave forum thread if someone could run some tests.

If I understand the situation correctly, all functions (not subroutines) that return float are affected. From what I can tell, complex real is not affected. But I will try to have that confirmed.

Grepping for real\s*function in the sources of the reference implementation of BLAS (and dropping the hits for complex functions), I end up with the following list:

  • sasum
  • scabs1
  • scasum
  • scnrm2
  • sdot
  • sdsdot
  • snrm2

I wouldn't be surprised if they (the Apples devs) did the same mistake in their implementations of LAPACK functions. If that is true (I'll try to have that confirmed), that would also affect the following LAPACK functions in Apple Accelerate:

  • scsum1
  • sladiv
  • slangb
  • slange
  • slangt
  • slanhs
  • slansb
  • slansf
  • slansp
  • slanst
  • slansy
  • slantb
  • slantp
  • slantr
  • slanpy2
  • slanpy3
  • sla_gbrcond
  • sla_gbrvgrw
  • sla_gecond
  • sla_gerpvgrw
  • sla_porcond
  • sla_porpvgrw
  • sla_syrcond
  • sla_syrpvgrw

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Feb 5, 2025

Thinking about it, there is probably no reason for excluding the C* functions from that list as long as they return REAL. So, functions like CLANGB are probably also affected...

@grisuthedragon
Copy link
Member

As a first though, I do not have a proper idea to implement a shim layer, which does its select on the return type, without recompiling the application. If you have any idea let me know.

The scabs1 function should be neglected, it is only for internal BLAS usage and, for example, not exposed by ATLAS or very old Accelerate frameworks(to my knowledge).

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Feb 5, 2025

I don't know how exactly FlexiBLAS forwards to the actual implementation of a function. But I'd imagine that it would grab a pointer to the function somehow.
If that is the case, a shim layer for sdot could look something like this:

float FC_GLOBAL(sdot,SDOT)(blasint* n, float* sx, blasint* incx, float* sy, blasint* incy)
{
  extern double (*sdot_ptr) (blasint*, float*, blasint*, float*, blasint*) = get_sdot_ptr ();  // I don't know how that is actually done in FlexiBLAS
  return (float) (sdot_ptr (n, sx, incx, sy, incy));
}

@dasergatskov
Copy link

dasergatskov commented Feb 5, 2025

I kind of understand how the Apple devs could have accidentally called f2c with the wrong flags at some point in the 70's or 80's. But I don't understand why they haven't fixed that since...

I suspect they are still using f2c as they do not have in-house Fortran compiler. Hopefully, flang-new is getting better and may be in a few years...

mmuetzel added a commit to mmuetzel/flexiblas that referenced this issue Feb 5, 2025
This is a very crude attempt to check whether BLAS functions that should
return single-precision floating point values actually do that.
Some implementations (like Apple Accelerate) don't do that. See mpimd-csc#65.

This is mainly meant for early feedback.
mmuetzel added a commit to mmuetzel/flexiblas that referenced this issue Feb 5, 2025
This is a very crude attempt to check whether BLAS functions that should
return single-precision floating point values actually do that.
Some implementations (like Apple Accelerate) don't do that. See mpimd-csc#65.

This is mainly meant for early feedback.
@mmuetzel
Copy link
Contributor Author

mmuetzel commented Feb 5, 2025

I've opened #68 for a potential implementation idea.

Would something like that work?

mmuetzel added a commit to mmuetzel/flexiblas that referenced this issue Feb 6, 2025
This is a very crude attempt to check whether BLAS functions that should
return single-precision floating point values actually do that.
Some implementations (like Apple Accelerate) don't do that. See mpimd-csc#65.

This is mainly meant for early feedback.
@grisuthedragon
Copy link
Member

Does cdotu return float _Complex or double _Complex regarding f2c this should be float _Complex via the intel abi.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Feb 10, 2025

CDOTU returns a float _Complex type value with Apple Accelerate.
See: https://octave.discourse.group/t/macos-accelerate-framework-single-precision-issue/6140/27

That test was with gfortran using FlexiBLAS with FLEXIBLAS=APPLE though - using complex, external :: cdotu as the expected return type in the Fortran sources.

@mmuetzel
Copy link
Contributor Author

(OT: Regarding the reference to f2c: I'm no longer certain that the difference in return type came to the Apple implementation via f2c. Reading f2c's manual, the -!R flag (the default) was meant to model f77.
Apparently, the DEC f77 Fortran compiler returned double precision values for REAL FUNCTION. Maybe, they modelled their implementation after that compiler (and f2c wasn't involved at all).
I don't have access to DEC f77 though. So, I can't confirm if that actually is the case.)

@mmuetzel
Copy link
Contributor Author

@dasergatskov tested using CDOTU with Apple Accelerate with an example written in C++: The return value is indeed of type std::complex<float> (i.e., float complex in C99).

@grisuthedragon
Copy link
Member

I am on the way building this into a new code gen, did any body test, if the LAPACK routines returning float behave like the BLAS ones. For complex float we already figured it out.

@mmuetzel
Copy link
Contributor Author

Thanks for looking into this.

If I recall correctly, we didn't check that yet.
I've asked for a test here: https://octave.discourse.group/t/macos-accelerate-framework-single-precision-issue/6140/37

If you have access to a macOS device with Apple Accelerate, you might be able to test that with the example programs from that thread.

@grisuthedragon
Copy link
Member

In the LAPACKE sources, there is a hint that is true for LAPACK as well.. .

See
https://github.com/Reference-LAPACK/lapack/blob/ca82e5ea5cd2eaa769ac44fe199e1dce1c15c19e/LAPACKE/include/lapack.h#L110
and following.

@mmuetzel
Copy link
Contributor Author

mmuetzel commented Mar 25, 2025

Anything else would have been surprising. But one never knows...

@dasergatskov also confirmed that scsum1 from Apple Accelerate has the same issue. Thank you for that!

So, the test agrees with that comment in the LAPACKE header. (But I wouldn't be so general as to blame f2c without some limitations. After all, it does produce C code with the "correct" return type if it is used with the correct flags...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants