C fundamentals · lesson 12

Floating-point numbers

IEEE 754 isn't a detail — it's the contract your CPU makes with you about what arithmetic means. Breaking it causes subtle, silent, hard-to-reproduce bugs.

in progress
10 min

0.1 + 0.2 == 0.3 is false in C. Not because of a bug. Because 0.1 and 0.2 cannot be represented exactly in binary — they're infinite repeating fractions in base 2, like 1/3 in base 10. The CPU truncates them to fit 32 or 64 bits, and the truncation errors accumulate. Every float operation is an approximation. Knowing exactly how good that approximation is — and when it breaks — is what this lesson is about.

💡
Prerequisite: lesson 02 covers the three-field layout (sign / exponent / mantissa) and the 0x4048F5C3 example for 3.14f. This lesson builds on that — read lesson 02 first if you haven't.

IEEE 754 decoded

A 32-bit float stores three fields packed into 32 bits:

S
exponent (8 bits)
mantissa (23 bits)
bit 31               bits 30–23                                      bits 22–0

The value formula is: (-1)^S × 1.mantissa × 2^(exponent − 127)

The exponent is biased by 127 — add 127 to the true exponent to get the stored value. A stored exponent of 0b10000000 (128) means 2^(128−127) = 2^1 = 2. The mantissa has an implicit leading 1 that isn't stored — this is the "hidden bit" that gives you 24 effective bits of precision from 23 stored.

Decoding 3.14f by hand

3.14f is stored as 0x4048F5C3. Let's decode it:

hex 0x4048F5C3
binary 0 10000000 10010001111010111000011
sign 0 → positive
exponent 10000000 = 128 → 128 − 127 = 1, so 2^1 = 2
mantissa 1.10010001111010111000011 (implicit leading 1)
value 1.5700000... × 2 = 3.14000010... (not exactly 3.14)

The stored value is 3.14000010... — close, but not exactly π/something. Those last bits are rounding error baked in at parse time.

Special values

IEEE 754 reserves exponent patterns 0x00 and 0xFF for special values:

value exponent mantissa what triggers it
+Inf 0xFF 0 division by zero, overflow
-Inf 0xFF 0 negative overflow
NaN 0xFF ≠ 0 0/0, sqrt(−1), Inf−Inf
-0.0 0x00 0 negative underflow
denormal 0x00 ≠ 0 numbers smaller than FLT_MIN
c
float a = 1.0f / 0.0f; // +Inf — not a crash, not UB for floats float b = 0.0f / 0.0f; // NaN float c = -1.0f / 0.0f; // -Inf // NaN is not equal to anything, including itself printf("%d\n", b == b); // prints 0 — use isnan() instead printf("%d\n", isnan(b)); // prints 1 // -0.0 compares equal to +0.0 float neg_zero = -0.0f; printf("%d\n", neg_zero == 0.0f); // prints 1 printf("%d\n", signbit(neg_zero)); // prints 1 — different sign
⚠️
Integer division by zero is UB. Float division by zero is not. 1 / 0 is undefined behavior in C — the program may crash or do anything. 1.0f / 0.0f is defined by IEEE 754 to produce +Inf. The two behave completely differently.

Machine epsilon

Machine epsilon is the smallest value ε such that 1.0 + ε ≠ 1.0. For float it's about 1.19 × 10⁻⁷ (FLT_EPSILON). For double it's about 2.22 × 10⁻¹⁶ (DBL_EPSILON).

This is not the smallest representable float. It's the spacing between 1.0 and the next representable number. The spacing changes with magnitude — near 1000.0, the gap is ~1000 × epsilon. A value of 1.0 + 1e-8 is indistinguishable from 1.0 in single precision.

c
#include <float.h> // WRONG — loop may never terminate if step is smaller than epsilon near x for (float x = 0.0f; x != 1.0f; x += 0.1f) { ... } // WRONG — comparing floats for exact equality is almost always wrong if (result == 0.0f) { ... } // RIGHT — compare within a tolerance if (fabsf(result) < 1e-6f) { ... } // RIGHT — use integer loop counter, compute float from it for (int i = 0; i < 10; i++) { float x = i * 0.1f; }
🔥
Never use == to compare floats unless you specifically mean "these exact bits." Use fabsf(a - b) < tolerance. Choosing the right tolerance depends on the magnitude of the values — FLT_EPSILON is correct near 1.0 but completely wrong near 1,000,000.

Catastrophic cancellation

Subtract two nearly-equal floats and you lose almost all your significant digits. The error that was tiny relative to each number becomes large relative to the difference.

c
// computing variance: E[x²] - E[x]² // Naive "one-pass" formula — catastrophically cancels when mean is large float sum = 0, sum_sq = 0; for (int i = 0; i < n; i++) { sum += x[i]; sum_sq += x[i] * x[i]; } float var = sum_sq / n - (sum / n) * (sum / n); // ← can go negative! // Welford's algorithm — numerically stable, same O(n) cost float mean = 0, M2 = 0; for (int i = 0; i < n; i++) { float delta = x[i] - mean; mean += delta / (i + 1); M2 += delta * (x[i] - mean); } float var = M2 / n; // always non-negative

The classic example: computing the roots of a quadratic. The formula (-b ± sqrt(b²−4ac)) / 2a cancels catastrophically when 4ac. The fix is to compute the root with the larger absolute value using the standard formula, then derive the other root from c / (a × r₁), which avoids the subtraction.

Compiler reordering and -ffast-math

Floating-point addition is not associative. (a + b) + c ≠ a + (b + c) in general — the rounding at each step is different. The C standard requires the compiler to respect the parenthesization you wrote. With -ffast-math, you tell the compiler it can reorder, contract, and approximate freely — it may vectorize a loop using FMA instructions that give different rounding than the scalar version.

💡
-ffast-math also enables -ffinite-math-only (assumes no NaN or Inf will occur) and -fno-rounding-math. This is often fine for graphics and ML workloads, but can silently break code that relies on NaN propagation for error detection, or that depends on exact IEEE rounding for reproducibility.

float vs double vs long double

type size decimal digits use when
float 4 bytes ~7 GPU/SIMD, large arrays, ML weights
double 8 bytes ~15 general-purpose, scientific, default choice
long double 10–16 bytes ~18–34 intermediate accumulation, compensated summation

Use double by default. Use float when memory bandwidth or SIMD throughput matters (processing millions of values) and you've verified the precision is sufficient. long double is platform-specific — 80-bit extended precision on x86, 128-bit on some ARM, 64-bit (same as double) on MSVC. Don't rely on it for portability.

Fixed-point arithmetic

When you need exact decimal fractions — currency, sensor readings with known resolution — use integers and track the scale factor explicitly. A price of $12.34 stored as int cents = 1234 never has rounding error. Divide by 100 only for display.

c
// Fixed-point Q16.16: 16 bits integer, 16 bits fraction // multiply two Q16.16 values (result must be shifted right by 16) typedef int32_t fixed; #define FRAC_BITS 16 #define TO_FIXED(x) ((fixed)((x) * (1 << FRAC_BITS))) #define FROM_FIXED(x) ((float)(x) / (1 << FRAC_BITS)) fixed fixed_mul(fixed a, fixed b) { return (fixed)(((int64_t)a * b) >> FRAC_BITS); }
one-line takeaway

Every float operation is a rounded approximation — knowing the precision budget, the special values, and the cancellation traps is what separates float code that works from float code that merely compiles.