Skip to content

Commit 43e8e45

Browse files
committed
gguf.md: add BF16
1 parent bda6204 commit 43e8e45

File tree

1 file changed

+4
-3
lines changed

1 file changed

+4
-3
lines changed

docs/gguf.md

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -33,9 +33,10 @@ The components are:
3333
- `M`: Million parameters.
3434
- `K`: Thousand parameters.
3535
5. **Quantization**: This part specifies how the model parameters are quantized or compressed.
36-
- Uncompressed formats:
37-
- `F16`: 16-bit floats per weight
38-
- `F32`: 32-bit floats per weight
36+
- Floating Representation:
37+
- `BF16`: [bfloat16](https://en.wikipedia.org/wiki/Bfloat16_floating-point_format) 16-bit [Google Brain](https://en.wikipedia.org/wiki/Google_Brain) truncated form of 32-bit IEEE 754 (1 sign bit, 8 exponent bits, 7 fractional bits)
38+
- `F32`: 32-bit IEEE 754 floats per weight (1 sign bit, 8 exponent bits, 23 fractional bits)
39+
- `F16`: 16-bit IEEE 754 floats per weight (1 sign bit, 5 exponent bits, 10 fractional bits)
3940
- Quantization (Compression) formats:
4041
- `Q<X>`: X bits per weight, where `X` could be `4` (for 4 bits) or `8` (for 8 bits) etc...
4142
- Variants provide further details on how the quantized weights are interpreted:

0 commit comments

Comments
 (0)