Floating-Point Numbers — C Fundamentals

0.1 + 0.2 == 0.3 is false in C. Not because of a bug. Because 0.1 and 0.2 cannot be represented exactly in binary — they're infinite repeating fractions in base 2, like 1/3 in base 10. The CPU truncates them to fit 32 or 64 bits, and the truncation errors accumulate. Every float operation is an approximation. Knowing exactly how good that approximation is — and when it breaks — is what this lesson is about.

💡

Prerequisite: lesson 02 covers the three-field layout (sign / exponent / mantissa) and the 0x4048F5C3 example for 3.14f. This lesson builds on that — read lesson 02 first if you haven't.

IEEE 754 decoded

A 32-bit float stores three fields packed into 32 bits:

exponent (8 bits)

mantissa (23 bits)

bit 31 bits 30–23 bits 22–0

The value formula is: (-1)^S × 1.mantissa × 2^(exponent − 127)

The exponent is biased by 127 — add 127 to the true exponent to get the stored value. A stored exponent of 0b10000000 (128) means 2^(128−127) = 2^1 = 2. The mantissa has an implicit leading 1 that isn't stored — this is the "hidden bit" that gives you 24 effective bits of precision from 23 stored.

Decoding 3.14f by hand

3.14f is stored as 0x4048F5C3. Let's decode it:

hex	0x4048F5C3
binary	0 10000000 10010001111010111000011
sign	0 → positive
exponent	10000000 = 128 → 128 − 127 = 1, so 2^1 = 2
mantissa	1.10010001111010111000011 (implicit leading 1)
value	1.5700000... × 2 = 3.14000010... (not exactly 3.14)

The stored value is 3.14000010... — close, but not exactly π/something. Those last bits are rounding error baked in at parse time.

Special values

IEEE 754 reserves exponent patterns 0x00 and 0xFF for special values:

value	exponent	mantissa	what triggers it
+Inf	0xFF	0	division by zero, overflow
-Inf	0xFF	0	negative overflow
NaN	0xFF	≠ 0	0/0, sqrt(−1), Inf−Inf
-0.0	0x00	0	negative underflow
denormal	0x00	≠ 0	numbers smaller than FLT_MIN

c

float a = 1.0f / 0.0f;   // +Inf — not a crash, not UB for floats
float b = 0.0f / 0.0f;   // NaN
float c = -1.0f / 0.0f;  // -Inf

// NaN is not equal to anything, including itself
printf("%d\n", b == b);   // prints 0 — use isnan() instead
printf("%d\n", isnan(b)); // prints 1

// -0.0 compares equal to +0.0
float neg_zero = -0.0f;
printf("%d\n", neg_zero == 0.0f);  // prints 1
printf("%d\n", signbit(neg_zero)); // prints 1 — different sign
          

⚠️

Integer division by zero is UB. Float division by zero is not. 1 / 0 is undefined behavior in C — the program may crash or do anything. 1.0f / 0.0f is defined by IEEE 754 to produce +Inf. The two behave completely differently.

Machine epsilon

Machine epsilon is the smallest value ε such that 1.0 + ε ≠ 1.0. For float it's about 1.19 × 10⁻⁷ (FLT_EPSILON). For double it's about 2.22 × 10⁻¹⁶ (DBL_EPSILON).

This is not the smallest representable float. It's the spacing between 1.0 and the next representable number. The spacing changes with magnitude — near 1000.0, the gap is ~1000 × epsilon. A value of 1.0 + 1e-8 is indistinguishable from 1.0 in single precision.

c

#include <float.h>

// WRONG — loop may never terminate if step is smaller than epsilon near x
for (float x = 0.0f; x != 1.0f; x += 0.1f) { ... }

// WRONG — comparing floats for exact equality is almost always wrong
if (result == 0.0f) { ... }

// RIGHT — compare within a tolerance
if (fabsf(result) < 1e-6f) { ... }

// RIGHT — use integer loop counter, compute float from it
for (int i = 0; i < 10; i++) {
    float x = i * 0.1f;
}
          

🔥

Never use == to compare floats unless you specifically mean "these exact bits." Use fabsf(a - b) < tolerance. Choosing the right tolerance depends on the magnitude of the values — FLT_EPSILON is correct near 1.0 but completely wrong near 1,000,000.

Catastrophic cancellation

Subtract two nearly-equal floats and you lose almost all your significant digits. The error that was tiny relative to each number becomes large relative to the difference.

c

// computing variance: E[x²] - E[x]²
// Naive "one-pass" formula — catastrophically cancels when mean is large
float sum = 0, sum_sq = 0;
for (int i = 0; i < n; i++) {
    sum    += x[i];
    sum_sq += x[i] * x[i];
}
float var = sum_sq / n - (sum / n) * (sum / n);  // ← can go negative!

// Welford's algorithm — numerically stable, same O(n) cost
float mean = 0, M2 = 0;
for (int i = 0; i < n; i++) {
    float delta = x[i] - mean;
    mean += delta / (i + 1);
    M2   += delta * (x[i] - mean);
}
float var = M2 / n;  // always non-negative
          

The classic example: computing the roots of a quadratic. The formula (-b ± sqrt(b²−4ac)) / 2a cancels catastrophically when b² ≫ 4ac. The fix is to compute the root with the larger absolute value using the standard formula, then derive the other root from c / (a × r₁), which avoids the subtraction.

Compiler reordering and -ffast-math

Floating-point addition is not associative. (a + b) + c ≠ a + (b + c) in general — the rounding at each step is different. The C standard requires the compiler to respect the parenthesization you wrote. With -ffast-math, you tell the compiler it can reorder, contract, and approximate freely — it may vectorize a loop using FMA instructions that give different rounding than the scalar version.

💡

-ffast-math also enables -ffinite-math-only (assumes no NaN or Inf will occur) and -fno-rounding-math. This is often fine for graphics and ML workloads, but can silently break code that relies on NaN propagation for error detection, or that depends on exact IEEE rounding for reproducibility.

float vs double vs long double

type	size	decimal digits	use when
float	4 bytes	~7	GPU/SIMD, large arrays, ML weights
double	8 bytes	~15	general-purpose, scientific, default choice
long double	10–16 bytes	~18–34	intermediate accumulation, compensated summation

Use double by default. Use float when memory bandwidth or SIMD throughput matters (processing millions of values) and you've verified the precision is sufficient. long double is platform-specific — 80-bit extended precision on x86, 128-bit on some ARM, 64-bit (same as double) on MSVC. Don't rely on it for portability.

Fixed-point arithmetic

When you need exact decimal fractions — currency, sensor readings with known resolution — use integers and track the scale factor explicitly. A price of $12.34 stored as int cents = 1234 never has rounding error. Divide by 100 only for display.

c

// Fixed-point Q16.16: 16 bits integer, 16 bits fraction
// multiply two Q16.16 values (result must be shifted right by 16)
typedef int32_t fixed;
#define FRAC_BITS 16
#define TO_FIXED(x)   ((fixed)((x) * (1 << FRAC_BITS)))
#define FROM_FIXED(x) ((float)(x) / (1 << FRAC_BITS))

fixed fixed_mul(fixed a, fixed b) {
    return (fixed)(((int64_t)a * b) >> FRAC_BITS);
}
          

one-line takeaway

Every float operation is a rounded approximation — knowing the precision budget, the special values, and the cancellation traps is what separates float code that works from float code that merely compiles.