0.1 + 0.2 == 0.3 is false in C. Not because of a bug. Because
0.1 and 0.2 cannot be represented exactly in binary —
they're infinite repeating fractions in base 2, like 1/3 in base 10.
The CPU truncates them to fit 32 or 64 bits, and the truncation errors accumulate.
Every float operation is an approximation. Knowing exactly how good that approximation is
— and when it breaks — is what this lesson is about.
IEEE 754 decoded
A 32-bit float stores three fields packed into 32 bits:
The value formula is: (-1)^S × 1.mantissa × 2^(exponent − 127)
The exponent is biased by 127 — add 127 to the true exponent to get the stored value.
A stored exponent of 0b10000000 (128) means 2^(128−127) = 2^1 = 2.
The mantissa has an implicit leading 1 that isn't stored — this is the "hidden bit" that
gives you 24 effective bits of precision from 23 stored.
Decoding 3.14f by hand
3.14f is stored as 0x4048F5C3. Let's decode it:
| hex | 0x4048F5C3 |
| binary | 0 10000000 10010001111010111000011 |
| sign | 0 → positive |
| exponent | 10000000 = 128 → 128 − 127 = 1, so 2^1 = 2 |
| mantissa | 1.10010001111010111000011 (implicit leading 1) |
| value | 1.5700000... × 2 = 3.14000010... (not exactly 3.14) |
The stored value is 3.14000010... — close, but not exactly π/something. Those last bits are rounding error baked in at parse time.
Special values
IEEE 754 reserves exponent patterns 0x00 and 0xFF for special values:
| value | exponent | mantissa | what triggers it |
|---|---|---|---|
| +Inf | 0xFF | 0 | division by zero, overflow |
| -Inf | 0xFF | 0 | negative overflow |
| NaN | 0xFF | ≠ 0 | 0/0, sqrt(−1), Inf−Inf |
| -0.0 | 0x00 | 0 | negative underflow |
| denormal | 0x00 | ≠ 0 | numbers smaller than FLT_MIN |
1 / 0 is undefined behavior in C — the program may crash or do anything.
1.0f / 0.0f is defined by IEEE 754 to produce +Inf.
The two behave completely differently.
Machine epsilon
Machine epsilon is the smallest value ε such that
1.0 + ε ≠ 1.0. For float it's about 1.19 × 10⁻⁷ (FLT_EPSILON).
For double it's about 2.22 × 10⁻¹⁶ (DBL_EPSILON).
This is not the smallest representable float. It's the spacing between 1.0 and the next representable number. The spacing changes with magnitude — near 1000.0, the gap is ~1000 × epsilon. A value of 1.0 + 1e-8 is indistinguishable from 1.0 in single precision.
== to compare floats unless you specifically mean
"these exact bits." Use fabsf(a - b) < tolerance. Choosing the right
tolerance depends on the magnitude of the values — FLT_EPSILON is correct
near 1.0 but completely wrong near 1,000,000.
Catastrophic cancellation
Subtract two nearly-equal floats and you lose almost all your significant digits. The error that was tiny relative to each number becomes large relative to the difference.
The classic example: computing the roots of a quadratic. The formula
(-b ± sqrt(b²−4ac)) / 2a cancels catastrophically when
b² ≫ 4ac. The fix is to compute the root with the larger
absolute value using the standard formula, then derive the other root from
c / (a × r₁), which avoids the subtraction.
Compiler reordering and -ffast-math
Floating-point addition is not associative. (a + b) + c ≠ a + (b + c)
in general — the rounding at each step is different. The C standard requires the compiler
to respect the parenthesization you wrote. With -ffast-math, you tell the
compiler it can reorder, contract, and approximate freely — it may vectorize a loop
using FMA instructions that give different rounding than the scalar version.
-ffast-math also enables -ffinite-math-only (assumes no
NaN or Inf will occur) and -fno-rounding-math. This is often fine for
graphics and ML workloads, but can silently break code that relies on NaN propagation
for error detection, or that depends on exact IEEE rounding for reproducibility.
float vs double vs long double
| type | size | decimal digits | use when |
|---|---|---|---|
| float | 4 bytes | ~7 | GPU/SIMD, large arrays, ML weights |
| double | 8 bytes | ~15 | general-purpose, scientific, default choice |
| long double | 10–16 bytes | ~18–34 | intermediate accumulation, compensated summation |
Use double by default. Use float when memory bandwidth or
SIMD throughput matters (processing millions of values) and you've verified the
precision is sufficient. long double is platform-specific — 80-bit
extended precision on x86, 128-bit on some ARM, 64-bit (same as double) on MSVC.
Don't rely on it for portability.
Fixed-point arithmetic
When you need exact decimal fractions — currency, sensor readings with known resolution —
use integers and track the scale factor explicitly. A price of $12.34 stored as
int cents = 1234 never has rounding error. Divide by 100 only for display.
Every float operation is a rounded approximation — knowing the precision budget, the special values, and the cancellation traps is what separates float code that works from float code that merely compiles.