Skip to content

use Eigen as a BLAS alternative #858

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 4 commits into from
May 27, 2019
Merged

Conversation

borg323
Copy link
Member

@borg323 borg323 commented May 23, 2019

This is a port of leela-zero/leela-zero#1692.

@oscardssmith
Copy link
Contributor

How does it compare performance-wise?

@borg323
Copy link
Member Author

borg323 commented May 23, 2019

I did a quick test with a T35 net and it is about 8-10% quicker than MKLOpenBLAS on my i5. More tests are of course welcome.
Edit: The above test was against OpenBLAS, MKL is about 2% faster than Eigen in the same test.

@lealgo
Copy link
Contributor

lealgo commented May 23, 2019

Builds and runs on Android. Great news! Now we've got two back-ends that are able to build through cross-compilation.

On my first tests it appears to be just a little bit slower than OpenBlas:

HWHWI:/data/local/tmp $ ./lc0-blas benchmark -w /sdcard/lc0/36089 --threads=8 --max-prefetch=0 --minibatch-size=128                                                                                         
       _
|   _ | |
|_ |_ |_| v0.22.0-dev built May 20 2019
Loading weights file from: /sdcard/lc0/36089
Creating backend [blas]...
BLAS, maximum batch size set to 256
BLAS vendor: OpenBlas.
OpenBlas [OpenBLAS 0.3.6 NO_LAPACK NO_LAPACKE NO_AFFINITY ARMV8 MAX_THREADS=8].
OpenBlas found 8 ARMV8 core(s).
OpenBLAS using 1 core(s) for this backend.
BLAS max batch size is 256.
Benchmark time 134ms, 2 nodes, 14 nps, move e2e4
Benchmark time 199ms, 3 nodes, 15 nps, move e2e4
Benchmark time 327ms, 5 nodes, 15 nps, move e2e4
Benchmark time 503ms, 8 nodes, 15 nps, move e2e4
Benchmark time 692ms, 12 nodes, 17 nps, move e2e4
Benchmark time 940ms, 20 nodes, 21 nps, move e2e4
Benchmark time 1241ms, 28 nodes, 22 nps, move e2e4
Benchmark time 1445ms, 30 nodes, 20 nps, move e2e4
Benchmark time 1455ms, 33 nodes, 22 nps, move e2e4
Benchmark time 1660ms, 47 nodes, 28 nps, move e2e4
Benchmark time 1850ms, 61 nodes, 32 nps, move e2e4
Benchmark time 1862ms, 64 nodes, 34 nps, move e2e4
Benchmark time 2007ms, 72 nodes, 35 nps, move e2e4
Benchmark time 2113ms, 79 nodes, 37 nps, move e2e4
Benchmark time 2506ms, 107 nodes, 42 nps, move e2e4
Benchmark time 3062ms, 160 nodes, 52 nps, move e2e4
Benchmark time 3702ms, 211 nodes, 56 nps, move e2e4
Benchmark time 4374ms, 268 nodes, 61 nps, move e2e4
Benchmark time 5216ms, 334 nodes, 64 nps, move e2e4
Benchmark time 7007ms, 477 nodes, 68 nps, move e2e4
Benchmark time 7008ms, 487 nodes, 69 nps, move e2e4
Benchmark time 7607ms, 553 nodes, 72 nps, move e2e4
bestmove e2e4
Benchmark final time 8.38771s calculating 77.1367 nodes per second.

HWHWI:/data/local/tmp $ ./lc0-eigen benchmark -w /sdcard/lc0/36089 --threads=8 --max-prefetch=0 --minibatch-size=128                                                                                        
       _
|   _ | |
|_ |_ |_| v0.22.0-dev built May 23 2019
Loading weights file from: /sdcard/lc0/36089
Creating backend [blas]...
Using Eigen version 3.3.7
BLAS max batch size is 256.
Benchmark time 163ms, 2 nodes, 12 nps, move e2e4
Benchmark time 245ms, 3 nodes, 12 nps, move e2e4
Benchmark time 412ms, 5 nodes, 12 nps, move e2e4
Benchmark time 642ms, 8 nodes, 12 nps, move e2e4
Benchmark time 885ms, 13 nodes, 14 nps, move e2e4
Benchmark time 1271ms, 22 nodes, 17 nps, move e2e4
Benchmark time 1730ms, 34 nodes, 19 nps, move e2e4
Benchmark time 1858ms, 38 nodes, 20 nps, move e2e4
Benchmark time 1959ms, 40 nodes, 20 nps, move e2e4
Benchmark time 1980ms, 44 nodes, 22 nps, move d2d4
Benchmark time 2311ms, 67 nodes, 28 nps, move d2d4
Benchmark time 2791ms, 100 nodes, 35 nps, move d2d4
Benchmark time 3417ms, 129 nodes, 37 nps, move d2d4
Benchmark time 3705ms, 153 nodes, 41 nps, move d2d4
Benchmark time 3791ms, 166 nodes, 43 nps, move e2e4
Benchmark time 3936ms, 179 nodes, 45 nps, move e2e4
Benchmark time 4842ms, 251 nodes, 51 nps, move e2e4
Benchmark time 5679ms, 299 nodes, 52 nps, move e2e4
Benchmark time 5684ms, 307 nodes, 54 nps, move e2e4
Benchmark time 6885ms, 393 nodes, 57 nps, move e2e4
Benchmark time 8298ms, 518 nodes, 62 nps, move e2e4
bestmove e2e4
Benchmark final time 9.32481s calculating 63.0576 nodes per second.

@lealgo
Copy link
Contributor

lealgo commented May 24, 2019

@@ -44,7 +56,7 @@ void Convolution1::Forward(const size_t batch_size, const size_t input_channels,

const float* batch_input = input + i * kSquares * input_channels;
float* batch_output = output + i * kSquares * output_channels;

#ifndef USE_EIGEN
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could BLAS and Eigen in theory co-exist?
Then if would be better to have templated functions (e.g. <bool is_eigen>) which would be shown up as different backends.

See for example:
cudnn vs cudnn-fp16
https://github.com/LeelaChessZero/lc0/blob/master/src/neural/cuda/network_cudnn.cc#L805
#5 (comment)

tensorflow vs tensorflow-cpu
https://github.com/LeelaChessZero/lc0/blob/master/src/neural/network_tf.cc#L342

May be not that straighforward though as different header files are needed (so #ifdefs will be needed in any case), and calling non-included functions in if(false) {} won't compile.
Calling them from non-instanciated function template specializations will work though:

template<>
DoStuff<false /* is eigen*/>() {
  known_functions();
}

template<>
DoStuff<true /* is eigen*/>() {
  unknown_functions();  // Is fine
}

#if defined(HAVE_EIGEN)
REGISTER_NETWORK("eigen", MakeBlasNetwork<true>, 90)
#endif
#if defined(HAVE_BLAS)
REGISTER_NETWORK("blas", MakeBlasNetwork<false>, 90)
#endif

Not sure if easy enough to be bothered with, but that way it would be nicer.

Also there is this brute-force way to allow co-existance:

template<bool is_eigen> MyFunction() {
#ifdef HAVE_EIGEN
  if (is_eigen) {
    // Do eigen stuff.
  }
#endif
#ifdef HAVE_BLAS
  if (!is_eigen) {
    // Do blas stuff.
  }
#endif
}

#if defined(HAVE_EIGEN)
REGISTER_NETWORK("eigen", MakeBlasNetwork<true>, 90)
#endif
#if defined(HAVE_BLAS)
REGISTER_NETWORK("blas", MakeBlasNetwork<false>, 90)
#endif

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there is a reason for this: eigen is only suitable for the build machine, so blas is the way to go for redistributable cpu only binaries. The only use case I can see is for benchmarking between eigen and mkl on a specific machine - is there any other use case?

@borg323 borg323 merged commit 6028c05 into LeelaChessZero:master May 27, 2019
@borg323 borg323 deleted the eigen branch May 27, 2019 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants