Eu Chang Xian

Back

Floats#

Interesting problem encountered during my internship O^O.

Problem Statement#

Consider a doubleToInt function used to normalise floats into integers for subsequent computations:

constexpr std::size_t NUM_BITS_VALUE = 63;
constexpr std::uint64_t SCALE = 10'000'000'000'000ULL; // log2(10^13) ~= 44 bits to represent
constexpr std::uint64_t OFFSET = 1ULL << (NUM_BITS_VALUE - 1); // 2^62
constexpr std::uint64_t SCALE_FLAG = 1ULL << NUM_BITS_VALUE;   // 2^63

std::uint64_t doubleToInt(double price) {
    double adjustment = (price > 0) ? 0.5 : -0.5;
    return (static_cast<std::int64_t>(price * SCALE + adjustment) + OFFSET) | SCALE_FLAG;
}
c

This looks extremely innocent at first-glance.

However, when compiling with optimisations (non-Debug), e.g., -DCMAKE_BUILD_TYPE=RelWithDebInfo, tests that are “unlucky” will fail.

This failure is due to a divergence in the computed value with/without optimisations. Take this test assertion as a concrete example:

// Note that these are adapted snippets, not the actual code.
auto parsedDouble = std::stod("110987.6543210987654");
auto computedIntPrice = doubleToInt(parsedDouble);
EXPECT_EQ(doubleToInt(110987.6543210987654), computedIntPrice);
// ...
c

We would get this failure:

[ RUN      ] MessageInterpreter.OrderBookSnapshot
test.cpp:1086: Failure
Expected equality of these values:
  doubleToInt(110987.6543210987654)
    Which is: 14944934598493151360
  computedIntPrice
    Which is: 14944934598493151232
[  FAILED  ] MessageInterpreter.OrderBookSnapshot (1 ms)
bash

Background (Theory)#

This part is a little heavy.

To understand why the computed values are different, we first have to look at how computers represent real numbers.

Basics#

This part will cover the basics of floats.

If you have taken CS2100 (and it’s still fresh in your head) in NUS, then this part can be skimmed.

How Real Numbers are Represented#

A double is 64-bits wide. This means that it can only represent exactly 2642^{64} unique values.

Because the set of Real Numbers are uncountably infinite, most (or infinite, actually… Cantor’s Diagonalisation!) real numbers cannot be represented exactly, and must be rounded to a representable value.

Converting base-10 Real Numbers to its base-2 Representation#

To refresh on the algorithm to represent a base-10 number in its IEEE-754 double-precision format: The breakdown of a double’s 64-bit is as such:

SignBiased-ExponentMantissa
1 bit11 bits52 bits
  • Sign: 0 if the number is positive, 1 otherwise.
  • Biased-Exponent: Represents the power to which the base is raised.
    • A bias=2101=1023\textbf{bias} = 2^{10} - 1 = 1023 is added to be able to represent positive and negative exponents.
    • Hence, the true exponent can represent the set of integers in the range [1022,1023][-1022, 1023] instead of just [0,2047][0, 2047].
  • Mantissa: Stores the precision digits of the number.
    • Given that non-zero numbers always have a set bit in their binary representation, we can actually represent 52+1 bits of precision, with the implicit leading 1-bit (floating point numbers are always normalised).

Exponent Range

Using a simple example, 6.625-6.625:

  1. Convert the Magnitude to Binary

We convert the integer and fractional parts separately.

Integers can be directly converted into base-2:

  • Integer(6106_{10}): 1102110_2

For Fractions, a simple algorithm is used.

  1. We multiply the fraction by 2, then
  2. take and truncate the leading digit (either 0 or 1).
  3. Repeat this until we get 0, i.e., 02=00*2 = 0, OR we get a seen-before fraction. In this case, the number is not perfectly representable.
  • Fraction(.62510.625_{10}):
    • 0.62510210=1.25100.625_{10} * 2_{10} = 1.25_{10} (take the 1)
    • 0.510210=0.5100.5_{10} * 2_{10} = 0.5_{10} (take the 0)
    • 0.510210=1.0100.5_{10} * 2_{10} = 1.0_{10} (take the 1)
    • Result: 1012101_{2}

Combining both parts, we get: 101.1012101.101_{2} (of course, there are no decimal points in binary. See the next step!)

  1. Normalise.

Shift the decimal point left until there is only one non-zero digit to the left of it.

  • 110.1012110.101_{2} -> 1.10101221021.10101_{2} * 2^{2}_{10}

Hence, the true exponent is 2102_{10}.

  1. Calculate each component
  • Sign (S): The number 6.62510-6.625_{10} is negative, so S=1S=1.
  • Biased Exponent (E): Add the bias (recall: 1023 for double-precision floats)
    • 210+102310=1025102_{10} + 1023_{10} = 1025_{10}
    • 102510=10 0000 000121025_{10} = 10\ 0000\ 0001_{2}
  • Mantissa (M): Take the bits after the decimal point from the normalised number, and pad it to 52 bits.
    • .101012=1010 1000 00002 ....10101_{2} = 1010\ 1000\ 0000_{2}\ ... (not gonna show this…)

Assemble the components SEMS|E|M, we get:

110 0000 00011010 1000 0000...1 | 10\ 0000\ 0001 | 1010\ 1000\ 0000...

Reconstructing the real number from its IEEE 754 representation is the same steps in reverse.

Left as a (trivial?) exercise for the reader.

Important Concepts#

In this section, I will cover the various concepts that I deem necessary to understand the problem explained at the start.

This assumes knowledge of how floats are represented in binary as explained above.

I will occasionally refer back to this snippet to explain:

Snippet#

Refer to this Compiler Explorer to verify the output, or take my word comments for it!

std::numeric_limits<double>::digits10/max_digits10#

static_assert(std::numeric_limits<double>::digits10 == 15);
static_assert(std::numeric_limits<double>::max_digits10 == 17);
c

I believe there’s some lack of clarity as to what these constants represent.

While both are similarly named, they are in diametrically opposite perspectives.

  • digits10: This is the number of base-10 digits that can be represented in a double without change due to rounding or overflow. This means that every 15-digit base-10 has a 1-1 bijective mapping to a double.

    • Simply put, any 15-digit number can survive the text->double->text round trip.
    • Mathematically, this is floor(53log10(2))=15floor(53 * log_{10}(2)) = 15
  • max_digits10: This is the number of base-10 digits required to represent every possible values of a double.

    • 17 digits are needed to ensure the number can survive the double->text->double round trip.
    • Mathematically, this is ceil(53log10(2)+1)=17ceil(53 * log_{10}(2) + 1) = 17

These digits are not arbitrary. They are the consequences of the limits of precision discussed in the following sections.

(cppreference is good :thumbsup:)

Precision#

In this part, we establish that a double has exactly 53 bits of precision, where we have 52 bits from the Mantissa plus the 1 implicit leading bit from normalisation.

Because 253=9,007,199,254,740,9922^{53} = 9,007,199,254,740,992 (16 digits), exceed this base-10 number and we stop being able to distinguish every number uniquely.

This means that we start to have a gap between each consecutive, representable base-10 number.

Unit in the Last Place (ULP)#

To quantify the “gap” between representable numbers, we must introduce the concept of Unit in the Last Place (ULP).

  • ULP is the distance between a given floating-point number and the very next representable floating-point number. In my snippet above, I use step-size colloquially.
  • The gap grows: Because the exponent scales the mantissa, the physical distance (ULP) between consecutive representable floats gets larger as the magnitude of the number increases, doubling with every base-10 digit added.

Hence, if a mathematical result falls in the “gap” between two representable floats, it must be rounded.

std::nextafter (and other related stl functions) demonstrate this effect.

In the snippet above, we see that adding 1.0 to 9,007,199,254,740,9929,007,199,254,740,99\textbf{2} results in 9,007,199,254,740,9949,007,199,254,740,99\textbf{4}, instead of the expected 9,007,199,254,740,9939,007,199,254,740,99\textbf{3}.

Note that this is NOT considered buggy. It is well-defined.

See the following section on Rounding.

Rounding#

Because the set of representable floating-point numbers is finite and unevenly spaced, almost every arithmetic operation (+,-,*,/) yield a mathematical result that cannot be stored exactly.

The IEEE 754 standard dictates how these infinite-precision results are mapped to our 53-bit constraints.

Clang in particular defaults to Round to nearest, ties to even:

  • If a value falls exactly in the middle of two representable floats, it tie-breaks by rounding to the float whose least significant bit is 0 (otherwise known as the “even” one).
  • For all other values, it rounds to the nearest representable float.

See the above snippet and the clang documentations

Tying Concepts Together#

Why does the ULP become >1 (2 to be exact) after 253=9,007,199,254,740,9922^{53} = 9,007,199,254,740,992?

In binary, 2532^{53} is represented as a 1 bit, followed by 53 zeros,

Recall that the Mantissa is only 52-bits long with an implicit leading 1 bit.

Focusing on the Mantissa component only, this is how 2532^{53} is represented:

Implicit Leading 1-bit52-bit MantissaExtra bit(s)
10000 … 00000

The very next representable double is obtained by toggling the least significant bit in the Mantissa.

This bit does NOT represent 20=12^0 = 1, but 21=22^1 = 2, because of the implicit trailing extra bits.

Hence, the next representable double is 253+22^{53} + 2.

The pattern in which the ULP increases is left as an exercise :).

What Breaks?#

In unoptimised code, every single operation (e.g., priceSCALEprice * SCALE) is executed, rounded to a 64-bit double, and stored in the appropriate register.

Then the next operation (+adjustment+ adjustment) is calculated, then rounded again.

However, when compiler optimisations are turned on (-O1, -O2, -O3), the compiler is allowed to use different CPU instructions to speed up math.

The culprit for divergence is the Fused Multiply-Add (FMA) instruction.

FMA computes (AB)+C(A * B) + C in a single discrete step.

Crucially, FMA calculates the exact infinite-precision result of (AB)+C(A * B) + C and only applies one single rounding step at the very end.

Because unoptimised code, or instructions executed at runtime without using FMA rounds twice and optimised code using FMA only round once, the last bits of the mantissa can easily diverge by 1-ULP.

Neither value is “wrong” per-se.

The optimised FMA result is mathematically closer to the true real number, but it also breaks exact equality assertions like EXPECT_EQ in our test suites.

Example (Application of Theory)#

To prove that FMA is the culprit for the divergence, we can look at the assembly generated by the compiler under two different targeting modes.

For this section:

  • CXX_COMPILER=clang 18.1.0
  • CXX_FLAGS=-std=c++23 -stdlib=libstdc++ -O3 -ffp-contract=on

We then examine the difference in asm instructions emitted additionally specifying -march=native.

Minimal Reproducible Harness#

Likewise, see: Compiler Explorer

Outputs#

# No -march=native specified, i.e., -march=x86-64
14835058133407163776 # compile-time result
14835058133407163648 # runtime result
bash
# -march=native specified
14835058133407163776 # compile-time result
14835058133407163776 # runtime result
bash

We can see that by specifying -march=native, both our results, in particularly the runtime result are the same as the compile-time ones.

As such, I (reasonably) concluded that the instruction stream taken by the compiler to fold the constants here, and in the (used-to) failed test cases are the same, when -march=native is specified.

Instruction-by-Instruction Walkthrough#

Here, we look at the two different instruction streams. One without -march=native, which uses Standard SSE2 (Streaming SIMD Extensions) Instructions, and the one with -march=native (AVX/FMA Instructions).

We also use the variable double price = 100'000.007'8125; here for our calculations, passing it into doubleToInt(price).

Without -march=native#

This is the assembly emitted (annotated by Gemini Pro 3.1).

Assembly makes one dizzy, so let’s focus on when rounding happens.

  1. mulsd xmm0 ...: This corresponds to price * 10'000'000'000'000. In our failed tests, the results are larger than 2532^{53}. Recall that doubles have exactly 53 bits of precision. Going beyond triggers a rounding.

Evaluating the exact mathematical product:

100,000.007812510,000,000,000,000=1,000,000,078,125,000,000100,000.0078125 * 10,000,000,000,000 = 1,000,000,078,125,000,000

This is in the 101810^{18} range. At this magnitude, the ULP is exactly 128 (again, left as an exercise for the reader, hint: 2592^{59}).

Because the ULP is 128, every representable float in this neighbourhood must be a multiple of 128.

If we divide our mathematical result by the ULP:

1,000,000,078,125,000,000/128=7,812,500,610,351,562.51,000,000,078,125,000,000 / 128 = 7,812,500,610,351,562.5

The .5 indicates that our exact value falls perfectly in the middle of two representable doubles:

  • Lower (7,812,500,610,351,562th double): 1,000,000,078,124,999,936
  • Upper (7,812,500,610,351,563th double): 1,000,000,078,125,000,064

Recall clang’s default rounding mode: Round to nearest, ties to even. Since we are in the middle, perfectly tied, the CPU rounds to the even, lower double.

Output: 1,000,000,078,124,999,936 (double)

  1. addsd xmm0 ...: Here, we add our adjustment (0.5). This is simple. The mathematical result becomes 1,000,000,078,124,999,936.5, but it simply gets rounded to the nearest double, resulting in no changes.

Output: 1,000,000,078,124,999,936 (subsequently cast to int64_t)

  1. Bitwise Operations:

Finally, we apply our masks.

  • OFFSET=1ULL<<62 (262=4,611,686,018,427,387,9042^{62} = 4,611,686,018,427,387,904)
  • SCALE_FLAG=1ULL<<63 (263=9,223,372,036,854,775,8082^{63} = 9,223,372,036,854,775,808)

Both of these are XOR-ed into the result, since we have an int64_t. Hence, we get: 1004503677752370432 ^ OFFSET ^ SCALE_FLAG. Plugging this into Compiler Explorer, and we get 14835058133407163648.

Familiar number!

Scroll back up to the Harness’s output to see for yourself!

With -march=native#

Now, let’s look at the instruction stream with FMA (also annotated by Gemini 3.1 Pro)!

  1. vfmadd231sd Here, the CPU computes the exact same mathematical formula, but skipping the first rounding event.

FMA, as its name suggests, fuses the multiplication and addition into a single internal step with infinite precision:

(100,000.007812510,000,000,000,000)+0.5=1,000,000,078,125,000,000.5(100,000.0078125 * 10,000,000,000,000) + 0.5 = 1,000,000,078,125,000,000.5

Now, we must round to the nearest representable double (again, multiples of 128). Calculating the distance to its neighbours:

ΔLower=1,000,000,078,125,000,000.51,000,000,078,124,999,936=64.5\Delta Lower = 1,000,000,078,125,000,000.5 - 1,000,000,078,124,999,936 = 64.5 ΔUpper=1,000,000,078,125,000,0641,000,000,078,125,000,000.5=63.5\Delta Upper = 1,000,000,078,125,000,064 - 1,000,000,078,125,000,000.5 = 63.5

Because we added 0.5 before rounding, the value is no longer a perfect tie. It is strictly closer to the Upper representable double!

Output: 1,000,000,078,125,000,064

Doing the same final bitwise operations: 1,000,000,078,125,000,064 ^ OFFSET ^ SCALE_FLAG, and plugging this into Compiler Explorer, we get: 14835058133407163776! Another familiar number!

Conclusion#

  • Floats are scary.
  • Add volatile in tests :)

Real Conclusion#

Use std::fma.

Or use std::llround.

Disclaimer#

AI-assisted for some parts, particularly the assembly walkthrough…

Floating Point Precision
Author Eu Chang Xian
Published at 11 March 2026