gguf.md: add BF16

mofosyne · mofosyne · commit 43e8e459c2ac · 2024-05-14T23:44:59.000+10:00
diff --git a/docs/gguf.md b/docs/gguf.md
@@ -33,9 +33,10 @@ The components are:
     - `M`: Million parameters.
     - `K`: Thousand parameters.
 5. **Quantization**: This part specifies how the model parameters are quantized or compressed.
-   - Uncompressed formats:
-     - `F16`: 16-bit floats per weight
-     - `F32`: 32-bit floats per weight
+   - Floating Representation:
+     - `BF16`: [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) 16-bit [Google Brain](https://en.wikipedia.org/wiki/Google_Brain) truncated form of 32-bit IEEE 754 (1 sign bit, 8 exponent bits, 7 fractional bits)
+     - `F32`: 32-bit IEEE 754 floats per weight (1 sign bit, 8 exponent bits, 23 fractional bits)
+     - `F16`: 16-bit IEEE 754 floats per weight (1 sign bit, 5 exponent bits, 10 fractional bits)
    - Quantization (Compression) formats:
      - `Q<X>`: X bits per weight, where `X` could be `4` (for 4 bits) or `8` (for 8 bits) etc...
      - Variants provide further details on how the quantized weights are interpreted: