Rudy’s OBTF Rudolf Adamkovič

Home / Computer science / Floating-point number (FP) – Generalize the base + small binary/decimal notes


Precision (TODO ranges)

Precision Bits Identifiers Note
Half 16 = 1 + 5 + 10 FP16, float16, IEEE 754 binary16
Single 32 = 1 + 8 + 23 FP32, float32, float IEEE 754 binary32
Double 64 = 1 + 11 + 52 FP64, float64, double IEEE 754 binary64
Brain 16 = 1 + 8 + 7 BF16, bfloat16 used in ANNs

where bits are: sign, significand, and exponent (excluding the hidden bit).


© 2025 Rudolf Adamkovič under GNU General Public License (GPL) version 3 or later.
Made with Emacs and secret alien technologies of yesteryear.