Open
Description
This is a big one
The only reason we use BLAS is that we don't have efficient implementation of matrix x matrix
multiplication. Naively doing parallel dot products is not optimal. We need to implement some of the fundamental GEMM optimizations such as block tiling and we need to implement this in a compact way that reuses the existing dot product code and supports all quantization types
More comments on this: