Rudy's OBTF

Home / Computer science / Floating-point number (FP) – Generalize the base + small binary/decimal notes

Format

In IEEE 754 binary16, binary32, and binary64:

\begin{equation*} x = \begin{cases} {(-1)}^{\alpha} 1.s \cdot 2^{e - b} & 0 < e < \max(e) && \text{normal} \\ {(-1)}^{\alpha} 0.s \cdot 2^{1 - b} & e = 0 && \text{subnormal} \\ {(-1)}^{\alpha} \infty & e = \max(e), s = 0 && \text{infinity} \\ \mathrm{NaN} & e = \max(e), s \neq 0 && \text{not a number signal} \\ \end{cases} \end{equation*}

\(\alpha\) is the sign bit
\(s\) is the significand
\(1.\), \(0.\) is the implicit bit
\(e\) is the \(E\)-bit exponent
\(b\) is the exponent bias \(b = 2^{E-1} - 1\)

(IEEE 754 also specifies base-\(10\) decimal formats.)