Floating Point Precision
or more precisely - imprecision!
Floats#
Interesting problem encountered during my internship O^O.
Problem Statement#
Consider a doubleToInt function used to normalise floats into integers for
subsequent computations:
constexpr std::size_t NUM_BITS_VALUE = 63;
constexpr std::uint64_t SCALE = 10'000'000'000'000ULL; // log2(10^13) ~= 44 bits to represent
constexpr std::uint64_t OFFSET = 1ULL << (NUM_BITS_VALUE - 1); // 2^62
constexpr std::uint64_t SCALE_FLAG = 1ULL << NUM_BITS_VALUE; // 2^63
std::uint64_t doubleToInt(double price) {
double adjustment = (price > 0) ? 0.5 : -0.5;
return (static_cast<std::int64_t>(price * SCALE + adjustment) + OFFSET) | SCALE_FLAG;
}cThis looks extremely innocent at first-glance.
However, when compiling with optimisations (non-Debug), e.g., -DCMAKE_BUILD_TYPE=RelWithDebInfo,
tests that are “unlucky” will fail.
This failure is due to a divergence in the computed value with/without optimisations. Take this test assertion as a concrete example:
// Note that these are adapted snippets, not the actual code.
auto parsedDouble = std::stod("110987.6543210987654");
auto computedIntPrice = doubleToInt(parsedDouble);
EXPECT_EQ(doubleToInt(110987.6543210987654), computedIntPrice);
// ...cWe would get this failure:
[ RUN ] MessageInterpreter.OrderBookSnapshot
test.cpp:1086: Failure
Expected equality of these values:
doubleToInt(110987.6543210987654)
Which is: 14944934598493151360
computedIntPrice
Which is: 14944934598493151232
[ FAILED ] MessageInterpreter.OrderBookSnapshot (1 ms)bashBackground (Theory)#
This part is a little heavy.
To understand why the computed values are different, we first have to look at how computers represent real numbers.
Basics#
This part will cover the basics of floats.
If you have taken CS2100 (and it’s still fresh in your head) in NUS, then this part can be skimmed.
How Real Numbers are Represented#
A double is 64-bits wide. This means that it can only represent exactly
unique values.
Because the set of Real Numbers are uncountably infinite, most (or infinite, actually… Cantor’s Diagonalisation!) real numbers cannot be represented exactly, and must be rounded to a representable value.
Converting base-10 Real Numbers to its base-2 Representation#
To refresh on the algorithm to represent a base-10 number in its IEEE-754
double-precision format:
The breakdown of a double’s 64-bit is as such:
| Sign | Biased-Exponent | Mantissa |
|---|---|---|
| 1 bit | 11 bits | 52 bits |
- Sign: 0 if the number is positive, 1 otherwise.
- Biased-Exponent: Represents the power to which the base is raised.
- A is added to be able to represent positive and negative exponents.
- Hence, the true exponent can represent the set of integers in the range instead of just .
- Mantissa: Stores the precision digits of the number.
- Given that non-zero numbers always have a set bit in their binary representation, we can actually represent 52+1 bits of precision, with the implicit leading 1-bit (floating point numbers are always normalised).
Using a simple example, :
- Convert the Magnitude to Binary
We convert the integer and fractional parts separately.
Integers can be directly converted into base-2:
- Integer():
For Fractions, a simple algorithm is used.
- We multiply the fraction by 2, then
- take and truncate the leading digit (either 0 or 1).
- Repeat this until we get 0, i.e., , OR we get a seen-before fraction. In this case, the number is not perfectly representable.
- Fraction():
- (take the 1)
- (take the 0)
- (take the 1)
- Result:
Combining both parts, we get: (of course, there are no decimal points in binary. See the next step!)
- Normalise.
Shift the decimal point left until there is only one non-zero digit to the left of it.
- ->
Hence, the true exponent is .
- Calculate each component
- Sign (S): The number is negative, so .
- Biased Exponent (E): Add the bias (recall: 1023 for double-precision floats)
- Mantissa (M): Take the bits after the decimal point from the
normalised number, and pad it to 52 bits.
- (not gonna show this…)
Assemble the components , we get:
Reconstructing the real number from its IEEE 754 representation is the same steps in reverse.
Left as a (trivial?) exercise for the reader.
Important Concepts#
In this section, I will cover the various concepts that I deem necessary to understand the problem explained at the start.
This assumes knowledge of how floats are represented in binary as explained above.
I will occasionally refer back to this snippet to explain:
Snippet#
#include <cmath>
#include <iostream>
#include <format>
#include <limits>
static_assert(std::numeric_limits<double>::digits == 52 + 1);
constexpr double MAX_REPR_INTEGER = (1ULL << std::numeric_limits<double>::digits);
static_assert(MAX_REPR_INTEGER == 9'007'199'254'740'992);
int main () {
auto prev = MAX_REPR_INTEGER - 1.0;
auto fromPrev = std::nextafter(prev, std::numeric_limits<double>::max());
auto next = std::nextafter(MAX_REPR_INTEGER, std::numeric_limits<double>::max());
auto roundMode = __builtin_flt_rounds();
// default to 1, "round to nearest, ties to even"
std::cout << std::format("Clang Rounding Mode: {}\n", roundMode);
// Difference should only be 1 => step size of 1.
std::cout << std::format("Previous: {}, Difference: {}\n", prev, MAX_REPR_INTEGER-prev);
// Step Size becomes 2, intermediate numbers cannot be represented.
std::cout << std::format("Next: {}, Difference: {}\n", next, next-MAX_REPR_INTEGER);
// Rounded down to the previous representable number: MAX_REPR_INTEGER.
std::cout << std::format("Add 1.0: {}\n", MAX_REPR_INTEGER+1.0);
// Rounded up to the next representable number: MAX_REPR_INTEGER+2
std::cout << std::format("Add 1.1: {}\n", MAX_REPR_INTEGER+1.1);
return 0;
}cRefer to this Compiler Explorer ↗ to verify the
output, or take my word comments for it!
std::numeric_limits<double>::digits10/max_digits10#
static_assert(std::numeric_limits<double>::digits10 == 15);
static_assert(std::numeric_limits<double>::max_digits10 == 17);cI believe there’s some lack of clarity as to what these constants represent.
While both are similarly named, they are in diametrically opposite perspectives.
-
digits10: This is the number of base-10 digits that can be represented in adoublewithout change due to rounding or overflow. This means that every 15-digit base-10 has a 1-1 bijective mapping to adouble.- Simply put, any 15-digit number can survive the text->
double->text round trip. - Mathematically, this is
- Simply put, any 15-digit number can survive the text->
-
max_digits10: This is the number of base-10 digits required to represent every possible values of adouble.- 17 digits are needed to ensure the number can survive the
double->text->doubleround trip. - Mathematically, this is
- 17 digits are needed to ensure the number can survive the
These digits are not arbitrary. They are the consequences of the limits of precision discussed in the following sections.
(cppreference is good :thumbsup:)
Precision#
In this part, we establish that a double has exactly 53 bits of precision,
where we have 52 bits from the Mantissa plus the 1 implicit leading bit from
normalisation.
Because (16 digits), exceed this base-10 number and we stop being able to distinguish every number uniquely.
This means that we start to have a gap between each consecutive, representable base-10 number.
Unit in the Last Place (ULP)#
To quantify the “gap” between representable numbers, we must introduce the concept of Unit in the Last Place (ULP).
- ULP is the distance between a given floating-point number and the very next representable floating-point number. In my snippet above, I use step-size colloquially.
- The gap grows: Because the exponent scales the mantissa, the physical distance (ULP) between consecutive representable floats gets larger as the magnitude of the number increases, doubling with every base-10 digit added.
Hence, if a mathematical result falls in the “gap” between two representable floats, it must be rounded.
std::nextafter (and other related stl functions) demonstrate this effect.
In the snippet above, we see that adding 1.0 to results in , instead of the expected .
Note that this is NOT considered buggy. It is well-defined.
See the following section on Rounding.
Rounding#
Because the set of representable floating-point numbers is finite and
unevenly spaced, almost every arithmetic operation (+,-,*,/) yield
a mathematical result that cannot be stored exactly.
The IEEE 754 standard dictates how these infinite-precision results are mapped to our 53-bit constraints.
Clang in particular defaults to Round to nearest, ties to even:
- If a value falls exactly in the middle of two representable floats, it tie-breaks by rounding to the float whose least significant bit is 0 (otherwise known as the “even” one).
- For all other values, it rounds to the nearest representable float.
See the above snippet and the clang documentations ↗
Tying Concepts Together#
Why does the ULP become >1 (2 to be exact) after ?
In binary, is represented as a 1 bit, followed by 53 zeros,
Recall that the Mantissa is only 52-bits long with an implicit leading 1 bit.
Focusing on the Mantissa component only, this is how is represented:
| Implicit Leading 1-bit | 52-bit Mantissa | Extra bit(s) |
|---|---|---|
| 1 | 0000 … 0000 | 0 |
The very next representable double is obtained by toggling the least significant bit in the Mantissa.
This bit does NOT represent , but , because of the implicit trailing extra bits.
Hence, the next representable double is .
The pattern in which the ULP increases is left as an exercise :).
What Breaks?#
In unoptimised code, every single operation (e.g., ) is
executed, rounded to a 64-bit double, and stored in the appropriate register.
Then the next operation () is calculated, then rounded again.
However, when compiler optimisations are turned on (-O1, -O2, -O3), the
compiler is allowed to use different CPU instructions to speed up math.
The culprit for divergence is the Fused Multiply-Add (FMA) instruction.
FMA computes in a single discrete step.
Crucially, FMA calculates the exact infinite-precision result of and only applies one single rounding step at the very end.
Because unoptimised code, or instructions executed at runtime without using FMA rounds twice and optimised code using FMA only round once, the last bits of the mantissa can easily diverge by 1-ULP.
Neither value is “wrong” per-se.
The optimised FMA result is mathematically closer to the true real number,
but it also breaks exact equality assertions like EXPECT_EQ in our test suites.
Example (Application of Theory)#
To prove that FMA is the culprit for the divergence, we can look at the assembly generated by the compiler under two different targeting modes.
For this section:
CXX_COMPILER=clang 18.1.0CXX_FLAGS=-std=c++23 -stdlib=libstdc++ -O3 -ffp-contract=on
We then examine the difference in asm instructions emitted additionally
specifying -march=native.
Minimal Reproducible Harness#
Likewise, see: Compiler Explorer ↗
#include <cstdint>
#include <cstddef>
#include <limits>
#include <format>
#include <iostream>
constexpr std::size_t NUM_BITS_VALUE = 63;
constexpr std::uint64_t SCALE = 10'000'000'000'000ULL;
constexpr std::uint64_t OFFSET = 1ULL << (NUM_BITS_VALUE - 1);
constexpr std::uint64_t SCALE_FLAG = 1ULL << NUM_BITS_VALUE;
constexpr std::uint64_t doubleToInt(double price) {
double adjustment = (price > 0) ? 0.5 : -0.5;
return (static_cast<std::int64_t>(price * SCALE + adjustment) + OFFSET) | SCALE_FLAG;
}
// only 13 sig figs, exact binary representation
constexpr double literal = 100'000.007'8125;
// Coerce compile-time evaluation
constexpr std::uint64_t getCompileTime() {
return doubleToInt(literal);
}
// Force runtime evaluation
volatile double runtimeDouble = literal;
[[gnu::noinline]] constexpr std::uint64_t getRunTime(double val) {
return doubleToInt(val);
}
int main() {
// Evaluated by compiler (FMA)
auto compileTimeResult = getCompileTime();
// Evaluated by CPU (Separate mulsd and addsd)
auto runtimeResult = getRunTime(runtimeDouble);
std::cout << std::format("{}\n", compileTimeResult);
std::cout << std::format("{}\n", runtimeResult);
return 0;
}cOutputs#
# No -march=native specified, i.e., -march=x86-64
14835058133407163776 # compile-time result
14835058133407163648 # runtime resultbash# -march=native specified
14835058133407163776 # compile-time result
14835058133407163776 # runtime resultbashWe can see that by specifying -march=native, both our results, in
particularly the runtime result are the same as the compile-time ones.
As such, I (reasonably) concluded that the instruction stream taken by
the compiler to fold the constants here, and in the (used-to) failed test cases
are the same, when -march=native is specified.
Instruction-by-Instruction Walkthrough#
Here, we look at the two different instruction streams. One without -march=native,
which uses Standard SSE2 (Streaming SIMD Extensions) Instructions, and the one
with -march=native (AVX/FMA Instructions).
We also use the variable double price = 100'000.007'8125; here for our calculations,
passing it into doubleToInt(price).
Without -march=native#
This is the assembly emitted (annotated by Gemini Pro 3.1).
getRunTime(double):
; --- Step 1: Determine the adjustment (price > 0 ? 0.5 : -0.5) ---
xorpd xmm1, xmm1 ; Zero out the xmm1 register (xmm1 = 0.0)
xor eax, eax ; Zero out the eax register (eax = 0)
ucomisd xmm0, xmm1 ; Compare the input price (xmm0) with 0.0 (xmm1)
seta al ; Set the lowest byte (al) to 1 if price > 0, else 0
; --- Step 2: The Math (Separate Multiply and Add) ---
mulsd xmm0, qword ptr [rip + .LCPI1_0]; MULTIPLY: xmm0 = price * 10,000,000,000,000.
; **[ROUNDING EVENT #1]** The 53-bit mantissa is rounded here.
lea rcx, [rip + .LCPI1_1] ; Load the base memory address of our adjustment constants [-0.5, 0.5]
addsd xmm0, qword ptr [rcx + 8*rax] ; ADD: xmm0 = xmm0 + adjustment (using rax as an index).
; **[ROUNDING EVENT #2]** The 53-bit mantissa is rounded AGAIN.
; --- Step 3: Typecast and Bitwise Operations ---
cvttsd2si rax, xmm0 ; Cast to integer: Convert scalar double to 64-bit integer with truncation.
movabs rcx, 9223372036854775807 ; Load 0x7FFFFFFFFFFFFFFF (63 bits of 1s) into rcx
and rcx, rax ; Mask the integer to exactly 63 bits
movabs rax, -4611686018427387904 ; Load 0xC000000000000000 (OFFSET | SCALE_FLAG)
xor rax, rcx ; Apply the offset and flag via XOR.
ret ; Return result in raxasmAssembly makes one dizzy, so let’s focus on when rounding happens.
mulsd xmm0 ...: This corresponds toprice * 10'000'000'000'000. In our failed tests, the results are larger than . Recall thatdoubleshave exactly 53 bits of precision. Going beyond triggers a rounding.
Evaluating the exact mathematical product:
This is in the range. At this magnitude, the ULP is exactly 128 (again, left as an exercise for the reader, hint: ).
Because the ULP is 128, every representable float in this neighbourhood must be a multiple of 128.
If we divide our mathematical result by the ULP:
The .5 indicates that our exact value falls perfectly in the middle of
two representable doubles:
- Lower (7,812,500,610,351,562th double): 1,000,000,078,124,999,936
- Upper (7,812,500,610,351,563th double): 1,000,000,078,125,000,064
Recall clang’s default rounding mode: Round to nearest, ties to even. Since we are in the middle, perfectly tied, the CPU rounds to the even, lower double.
Output: 1,000,000,078,124,999,936 (double)
addsd xmm0 ...: Here, we add our adjustment (0.5). This is simple. The mathematical result becomes 1,000,000,078,124,999,936.5, but it simply gets rounded to the nearest double, resulting in no changes.
Output: 1,000,000,078,124,999,936 (subsequently cast to int64_t)
- Bitwise Operations:
Finally, we apply our masks.
OFFSET=1ULL<<62()SCALE_FLAG=1ULL<<63()
Both of these are XOR-ed into the result, since we have an int64_t.
Hence, we get: 1004503677752370432 ^ OFFSET ^ SCALE_FLAG.
Plugging this into Compiler Explorer ↗,
and we get 14835058133407163648.
Familiar number!
Scroll back up to the Harness’s output to see for yourself!
With -march=native#
Now, let’s look at the instruction stream with FMA (also annotated by Gemini 3.1 Pro)!
getRunTime(double):
; --- Step 1: Determine the adjustment (Branchless Vector Operations) ---
vxorpd xmm1, xmm1, xmm1 ; Zero out xmm1 (xmm1 = 0.0)
vcmpltsd xmm1, xmm1, xmm0 ; Bitmask: 1s if 0.0 < price, else 0s
vmovddup xmm2, qword ptr [rip + .LCPI1_3] ; Load 0.5 into xmm2
vblendvpd xmm1, xmm2, xmmword ptr [rip + .LCPI1_1], xmm1 ; Select 0.5 or -0.5 based on mask. Result in xmm1.
; --- Step 2: The Math (FMA) ---
vfmadd231sd xmm1, xmm0, qword ptr [rip + .LCPI1_2] ; FUSED MULTIPLY ADD: xmm1 = (xmm0 * 10^13) + xmm1
; Computes to infinite precision internally.
; **[SINGLE ROUNDING EVENT]** Mantissa rounded ONCE.
; --- Step 3: Typecast and Bitwise Operations ---
vcvttsd2si rax, xmm1 ; Cast to integer with truncation.
mov cl, 63 ; Set cl register to 63
bzhi rcx, rax, rcx ; BMI2 Bit Extract: Zero out high bits from index 63
movabs rax, -4611686018427387904 ; Load (OFFSET | SCALE_FLAG)
xor rax, rcx ; Apply the offset and flag via XOR.
retasmvfmadd231sdHere, the CPU computes the exact same mathematical formula, but skipping the first rounding event.
FMA, as its name suggests, fuses the multiplication and addition into a single internal step with infinite precision:
Now, we must round to the nearest representable double (again, multiples of 128).
Calculating the distance to its neighbours:
Because we added 0.5 before rounding, the value is no longer a perfect tie.
It is strictly closer to the Upper representable double!
Output: 1,000,000,078,125,000,064
Doing the same final bitwise operations: 1,000,000,078,125,000,064 ^ OFFSET ^ SCALE_FLAG,
and plugging this into Compiler Explorer ↗,
we get: 14835058133407163776!
Another familiar number!
Conclusion#
- Floats are scary.
- Add
volatilein tests :)
Real Conclusion#
Use std::fma.
Or use std::llround.
Disclaimer#
AI-assisted for some parts, particularly the assembly walkthrough…