Rudy’s OBTF Rudolf Adamkovič

Home / Computer science / Floating-point number (FP) – Generalize the base + small binary/decimal notes


Format

In IEEE 754 binary16, binary32, and binary64:

\begin{equation*} x = \begin{cases} {(-1)}^{\alpha} 1.s \cdot 2^{e - b} & 0 < e < \max(e) && \text{normal} \\ {(-1)}^{\alpha} 0.s \cdot 2^{1 - b} & e = 0 && \text{subnormal} \\ {(-1)}^{\alpha} \infty & e = \max(e), s = 0 && \text{infinity} \\ \mathrm{NaN} & e = \max(e), s \neq 0 && \text{not a number signal} \\ \end{cases} \end{equation*}
  • \(\alpha\) is the sign bit
  • \(s\) is the significand
  • \(1.\), \(0.\) is the implicit bit
  • \(e\) is the \(E\)-bit exponent
  • \(b\) is the exponent bias \(b = 2^{E-1} - 1\)

(IEEE 754 also specifies base-\(10\) decimal formats.)


© 2025 Rudolf Adamkovič under GNU General Public License (GPL) version 3 or later.
Made with Emacs and secret alien technologies of yesteryear.