Eu Chang Xian

Clock Drift (WIP, ETA: Unknown!) ₍^. .^₎⟆

Wed, 20 May 2026 00:00:00 GMT

import { Aside, Collapse } from 'astro-pure/user'

Clock Drift

Upcoming (maybe) interesting topic/problem encountered!

Floating Point Precision

Wed, 11 Mar 2026 00:00:00 GMT

import { Aside, Collapse } from 'astro-pure/user'

Floats

Interesting problem encountered during my internship O^O.

Wherever I refer to floats, I am referring to double-precision, 64-bit floats specified by the IEEE-754 standard, corresponding to a double in C++, unless otherwise described.
Given that this problem emerged with our clang-compiled code, I will (mostly) be discussing clang only. gcc has a different default behaviour, but the theory/fundamentals are the same.

Problem Statement

Consider a doubleToInt function used to normalise floats into integers for subsequent computations:

constexpr std::size_t NUM_BITS_VALUE = 63;
constexpr std::uint64_t SCALE = 10'000'000'000'000ULL; // log2(10^13) ~= 44 bits to represent
constexpr std::uint64_t OFFSET = 1ULL << (NUM_BITS_VALUE - 1); // 2^62
constexpr std::uint64_t SCALE_FLAG = 1ULL << NUM_BITS_VALUE;   // 2^63

std::uint64_t doubleToInt(double price) {
    double adjustment = (price > 0) ? 0.5 : -0.5;
    return (static_cast<std::int64_t>(price * SCALE + adjustment) + OFFSET) | SCALE_FLAG;
}

This looks extremely innocent at first-glance.

However, when compiling with optimisations (non-Debug), e.g., -DCMAKE_BUILD_TYPE=RelWithDebInfo, tests that are "unlucky" will fail.

This failure is due to a divergence in the computed value with/without optimisations. Take this test assertion as a concrete example:

// Note that these are adapted snippets, not the actual code.
auto parsedDouble = std::stod("110987.6543210987654");
auto computedIntPrice = doubleToInt(parsedDouble);
EXPECT_EQ(doubleToInt(110987.6543210987654), computedIntPrice);
// ...

We would get this failure:

[ RUN      ] MessageInterpreter.OrderBookSnapshot
test.cpp:1086: Failure
Expected equality of these values:
  doubleToInt(110987.6543210987654)
    Which is: 14944934598493151360
  computedIntPrice
    Which is: 14944934598493151232
[  FAILED  ] MessageInterpreter.OrderBookSnapshot (1 ms)

Note that neither of the computed values are wrong!

These values can actually be obtained through the IEEE-754 standard, and are not the result of some spurious failure.

Background (Theory)

This part is a little heavy.

To understand why the computed values are different, we first have to look at how computers represent real numbers.

Basics

This part will cover the basics of floats.

If you have taken CS2100 (and it's still fresh in your head) in NUS, then this part can be skimmed.

How Real Numbers are Represented

A double is 64-bits wide. This means that it can only represent exactly $2^{64}$ unique values.

Because the set of Real Numbers are uncountably infinite, most (or infinite, actually... Cantor's Diagonalisation!) real numbers cannot be represented exactly, and must be rounded to a representable value.

Converting base-10 Real Numbers to its base-2 Representation

To refresh on the algorithm to represent a base-10 number in its IEEE-754 double-precision format: The breakdown of a double's 64-bit is as such:

| Sign | Biased-Exponent | Mantissa | | :---: | :-------------: | :------: | | 1 bit | 11 bits | 52 bits |

Sign: 0 if the number is positive, 1 otherwise.
Biased-Exponent: Represents the power to which the base is raised.
- A $\textbf{bias} = 2^{10} - 1 = 1023$ is added to be able to represent positive and negative exponents.
- Hence, the true exponent can represent the set of integers in the range $[-1022, 1023]$ instead of just $[0, 2047]$.
Mantissa: Stores the precision digits of the number.
- Given that non-zero numbers always have a set bit in their binary representation, we can actually represent 52+1 bits of precision, with the implicit leading 1-bit (floating point numbers are always normalised).

Although the exponent field has 11 bits (giving ($2^{11} = 2048$) possible values), not all exponent values represent normal numbers.

Two exponent patterns are reserved by the IEEE-754 standard:

| Stored Exponent | Meaning | | -------------------- | ------------------------------------------- | | 00000000000 (0) | Used for subnormal numbers and zero | | 11111111111 (2047) | Used for Infinity and NaN |

Therefore, normalised floating-point numbers only use exponent values:

$$ 1 \le E_{\text{stored}} \le 2046 $$

The true exponent is computed by subtracting the bias:

$$ E_{\text{true}} = E_{\text{stored}} - 1023 $$

This gives the usable exponent range:

$$ -1022 \le E_{\text{true}} \le 1023 $$

The smallest exponent (-1022) corresponds to the smallest normalised numbers, while the largest exponent (1023) corresponds to the largest finite numbers.

Using a simple example, $-6.625$:

Convert the Magnitude to Binary

We convert the integer and fractional parts separately.

Integers can be directly converted into base-2:

Integer($6_{10}$): $110_2$

For Fractions, a simple algorithm is used.

We multiply the fraction by 2, then
take and truncate the leading digit (either 0 or 1).
Repeat this until we get 0, i.e., $0*2 = 0$, OR we get a seen-before fraction. In this case, the number is not perfectly representable.

Fraction($.625_{10}$):
- $0.625_{10} * 2_{10} = 1.25_{10}$ (take the 1)
- $0.5_{10} * 2_{10} = 0.5_{10}$ (take the 0)
- $0.5_{10} * 2_{10} = 1.0_{10}$ (take the 1)
- Result: $101_{2}$

Combining both parts, we get: $101.101_{2}$ (of course, there are no decimal points in binary. See the next step!)

Normalise.

Shift the decimal point left until there is only one non-zero digit to the left of it.

$110.101_{2}$ -> $1.10101_{2} * 2^{2}_{10}$

Hence, the true exponent is $2_{10}$.

Calculate each component

Sign (S): The number $-6.625_{10}$ is negative, so $S=1$.
Biased Exponent (E): Add the bias (recall: 1023 for double-precision floats)
- $2_{10} + 1023_{10} = 1025_{10}$
- $1025_{10} = 10\ 0000\ 0001_{2}$
Mantissa (M): Take the bits after the decimal point from the normalised number, and pad it to 52 bits.
- $.10101_{2} = 1010\ 1000\ 0000_{2}\ ...$ (not gonna show this...)

Assemble the components $S|E|M$, we get:

$1 | 10\ 0000\ 0001 | 1010\ 1000\ 0000...$

Reconstructing the real number from its IEEE 754 representation is the same steps in reverse.

Left as a (trivial?) exercise for the reader.

Important Concepts

In this section, I will cover the various concepts that I deem necessary to understand the problem explained at the start.

This assumes knowledge of how floats are represented in binary as explained above.

I will occasionally refer back to this snippet to explain:

Snippet

#include <cmath>
#include <iostream>
#include <format>
#include <limits>

static_assert(std::numeric_limits<double>::digits == 52 + 1);
constexpr double MAX_REPR_INTEGER = (1ULL << std::numeric_limits<double>::digits);
static_assert(MAX_REPR_INTEGER == 9'007'199'254'740'992);


int main () {
    auto prev = MAX_REPR_INTEGER - 1.0;
    auto fromPrev = std::nextafter(prev, std::numeric_limits<double>::max());
    auto next = std::nextafter(MAX_REPR_INTEGER, std::numeric_limits<double>::max());

    auto roundMode = __builtin_flt_rounds();

    // default to 1, "round to nearest, ties to even"
    std::cout << std::format("Clang Rounding Mode: {}\n", roundMode);

    // Difference should only be 1 => step size of 1.
    std::cout << std::format("Previous: {}, Difference: {}\n", prev, MAX_REPR_INTEGER-prev);

    // Step Size becomes 2, intermediate numbers cannot be represented.
    std::cout << std::format("Next: {}, Difference: {}\n", next, next-MAX_REPR_INTEGER);

    // Rounded down to the previous representable number: MAX_REPR_INTEGER.
    std::cout << std::format("Add 1.0: {}\n", MAX_REPR_INTEGER+1.0);

    // Rounded up to the next representable number: MAX_REPR_INTEGER+2
    std::cout << std::format("Add 1.1: {}\n", MAX_REPR_INTEGER+1.1);
    return 0;
}

Refer to this Compiler Explorer to verify the output, or take my ~word~ comments for it!

`std::numeric_limits<double>::digits10/max_digits10`

static_assert(std::numeric_limits<double>::digits10 == 15);
static_assert(std::numeric_limits<double>::max_digits10 == 17);

I believe there's some lack of clarity as to what these constants represent.

While both are similarly named, they are in diametrically opposite perspectives.

digits10: This is the number of base-10 digits that can be represented in a double without change due to rounding or overflow. This means that every 15-digit base-10 has a 1-1 bijective mapping to a double.
- Simply put, any 15-digit number can survive the text->double->text round trip.
- Mathematically, this is $floor(53 * log_{10}(2)) = 15$
max_digits10: This is the number of base-10 digits required to represent every possible values of a double.
- 17 digits are needed to ensure the number can survive the double->text->double round trip.
- Mathematically, this is $ceil(53 * log_{10}(2) + 1) = 17$

These digits are not arbitrary. They are the consequences of the limits of precision discussed in the following sections.

(cppreference is good :thumbsup:)

See the next section. It's not a magic number.

Precision

In this part, we establish that a double has exactly 53 bits of precision, where we have 52 bits from the Mantissa plus the 1 implicit leading bit from normalisation.

Because $2^{53} = 9,007,199,254,740,992$ (16 digits), exceed this base-10 number and we stop being able to distinguish every number uniquely.

This means that we start to have a gap between each consecutive, representable base-10 number.

Unit in the Last Place (ULP)

To quantify the "gap" between representable numbers, we must introduce the concept of Unit in the Last Place (ULP).

ULP is the distance between a given floating-point number and the very next representable floating-point number. In my snippet above, I use step-size colloquially.
The gap grows: Because the exponent scales the mantissa, the physical distance (ULP) between consecutive representable floats gets larger as the magnitude of the number increases, doubling with every base-10 digit added.

Hence, if a mathematical result falls in the "gap" between two representable floats, it must be rounded.

std::nextafter (and other related stl functions) demonstrate this effect.

In the snippet above, we see that adding 1.0 to $9,007,199,254,740,99\textbf{2}$ results in $9,007,199,254,740,99\textbf{4}$, instead of the expected $9,007,199,254,740,99\textbf{3}$.

Note that this is NOT considered buggy. It is well-defined.

See the following section on Rounding.

Rounding

Because the set of representable floating-point numbers is finite and unevenly spaced, almost every arithmetic operation (+,-,*,/) yield a mathematical result that cannot be stored exactly.

The IEEE 754 standard dictates how these infinite-precision results are mapped to our 53-bit constraints.

Clang in particular defaults to Round to nearest, ties to even:

If a value falls exactly in the middle of two representable floats, it tie-breaks by rounding to the float whose least significant bit is 0 (otherwise known as the "even" one).
For all other values, it rounds to the nearest representable float.

See the above snippet and the clang documentations

Tying Concepts Together

Why does the ULP become >1 (2 to be exact) after $2^{53} = 9,007,199,254,740,992$?

In binary, $2^{53}$ is represented as a 1 bit, followed by 53 zeros,

Recall that the Mantissa is only 52-bits long with an implicit leading 1 bit.

Focusing on the Mantissa component only, this is how $2^{53}$ is represented:

| Implicit Leading 1-bit | 52-bit Mantissa | Extra bit(s) | | ---------------------: | :-------------: | :----------- | | 1 | 0000 ... 0000 | 0 |

The very next representable double is obtained by toggling the least significant bit in the Mantissa.

This bit does NOT represent $2^0 = 1$, but $2^1 = 2$, because of the implicit trailing extra bits.

Hence, the next representable double is $2^{53} + 2$.

The pattern in which the ULP increases is left as an exercise :).

What Breaks?

In unoptimised code, every single operation (e.g., $price * SCALE$) is executed, rounded to a 64-bit double, and stored in the appropriate register.

Then the next operation ($+ adjustment$) is calculated, then rounded again.

However, when compiler optimisations are turned on (-O1, -O2, -O3), the compiler is allowed to use different CPU instructions to speed up math.

The culprit for divergence is the Fused Multiply-Add (FMA) instruction.

FMA computes $(A * B) + C$ in a single discrete step.

Crucially, FMA calculates the exact infinite-precision result of $(A * B) + C$ and only applies one single rounding step at the very end.

Because unoptimised code, or instructions executed at runtime without using FMA rounds twice and optimised code using FMA only round once, the last bits of the mantissa can easily diverge by 1-ULP.

Neither value is "wrong" per-se.

The optimised FMA result is mathematically closer to the true real number, but it also breaks exact equality assertions like EXPECT_EQ in our test suites.

Example (Application of Theory)

To prove that FMA is the culprit for the divergence, we can look at the assembly generated by the compiler under two different targeting modes.

For this section:

CXX_COMPILER=clang 18.1.0
CXX_FLAGS=-std=c++23 -stdlib=libstdc++ -O3 -ffp-contract=on

We then examine the difference in asm instructions emitted additionally specifying -march=native.

Minimal Reproducible Harness

Likewise, see: Compiler Explorer

#include <cstdint>
#include <cstddef>
#include <limits>
#include <format>
#include <iostream>

constexpr std::size_t NUM_BITS_VALUE = 63;
constexpr std::uint64_t SCALE = 10'000'000'000'000ULL;
constexpr std::uint64_t OFFSET = 1ULL << (NUM_BITS_VALUE - 1);
constexpr std::uint64_t SCALE_FLAG = 1ULL << NUM_BITS_VALUE;

constexpr std::uint64_t doubleToInt(double price) {
    double adjustment = (price > 0) ? 0.5 : -0.5;
    return (static_cast<std::int64_t>(price * SCALE + adjustment) + OFFSET) | SCALE_FLAG;
}

// only 13 sig figs, exact binary representation
constexpr double literal = 100'000.007'8125;

// Coerce compile-time evaluation
constexpr std::uint64_t getCompileTime() {
    return doubleToInt(literal);
}


// Force runtime evaluation
volatile double runtimeDouble = literal;
[[gnu::noinline]] constexpr std::uint64_t getRunTime(double val) {
    return doubleToInt(val);
}

int main() {
    // Evaluated by compiler (FMA)
    auto compileTimeResult = getCompileTime();

    // Evaluated by CPU (Separate mulsd and addsd)
    auto runtimeResult = getRunTime(runtimeDouble);

    std::cout << std::format("{}\n", compileTimeResult);
    std::cout << std::format("{}\n", runtimeResult);
    return 0;
}

Outputs

# No -march=native specified, i.e., -march=x86-64
14835058133407163776 # compile-time result
14835058133407163648 # runtime result

# -march=native specified
14835058133407163776 # compile-time result
14835058133407163776 # runtime result

We can see that by specifying -march=native, both our results, in particularly the runtime result are the same as the compile-time ones.

As such, I (reasonably) concluded that the instruction stream taken by the compiler to fold the constants here, and in the (used-to) failed test cases are the same, when -march=native is specified.

Instruction-by-Instruction Walkthrough

Here, we look at the two different instruction streams. One without -march=native, which uses Standard SSE2 (Streaming SIMD Extensions) Instructions, and the one with -march=native (AVX/FMA Instructions).

We also use the variable double price = 100'000.007'8125; here for our calculations, passing it into doubleToInt(price).

Without `-march=native`

This is the assembly emitted (annotated by Gemini Pro 3.1).

getRunTime(double):
; --- Step 1: Determine the adjustment (price > 0 ? 0.5 : -0.5) ---
xorpd xmm1, xmm1                      ; Zero out the xmm1 register (xmm1 = 0.0)
xor eax, eax                          ; Zero out the eax register (eax = 0)
ucomisd xmm0, xmm1                    ; Compare the input price (xmm0) with 0.0 (xmm1)
seta al                               ; Set the lowest byte (al) to 1 if price > 0, else 0

; --- Step 2: The Math (Separate Multiply and Add) ---
mulsd xmm0, qword ptr [rip + .LCPI1_0]; MULTIPLY: xmm0 = price * 10,000,000,000,000.
                                      ; **[ROUNDING EVENT #1]** The 53-bit mantissa is rounded here.
lea rcx, [rip + .LCPI1_1]             ; Load the base memory address of our adjustment constants [-0.5, 0.5]
addsd xmm0, qword ptr [rcx + 8*rax]   ; ADD: xmm0 = xmm0 + adjustment (using rax as an index).
                                      ; **[ROUNDING EVENT #2]** The 53-bit mantissa is rounded AGAIN.

; --- Step 3: Typecast and Bitwise Operations ---
cvttsd2si rax, xmm0                   ; Cast to integer: Convert scalar double to 64-bit integer with truncation.
movabs rcx, 9223372036854775807       ; Load 0x7FFFFFFFFFFFFFFF (63 bits of 1s) into rcx
and rcx, rax                          ; Mask the integer to exactly 63 bits
movabs rax, -4611686018427387904      ; Load 0xC000000000000000 (OFFSET | SCALE_FLAG)
xor rax, rcx                          ; Apply the offset and flag via XOR.
ret                                   ; Return result in rax

Assembly makes one dizzy, so let's focus on when rounding happens.

mulsd xmm0 ...: This corresponds to price * 10'000'000'000'000. In our failed tests, the results are larger than $2^{53}$. Recall that doubles have exactly 53 bits of precision. Going beyond triggers a rounding.

Evaluating the exact mathematical product:

$$ 100,000.0078125 * 10,000,000,000,000 = 1,000,000,078,125,000,000 $$

This is in the $10^{18}$ range. At this magnitude, the ULP is exactly 128 (again, left as an exercise for the reader, hint: $2^{59}$).

Because the ULP is 128, every representable float in this neighbourhood must be a multiple of 128.

If we divide our mathematical result by the ULP:

$$ 1,000,000,078,125,000,000 / 128 = 7,812,500,610,351,562.5 $$

The .5 indicates that our exact value falls perfectly in the middle of two representable doubles:

Lower (7,812,500,610,351,562th double): 1,000,000,078,124,999,936
Upper (7,812,500,610,351,563th double): 1,000,000,078,125,000,064

Recall clang's default rounding mode: Round to nearest, ties to even. Since we are in the middle, perfectly tied, the CPU rounds to the even, lower double.

Output: 1,000,000,078,124,999,936 (double)

addsd xmm0 ...: Here, we add our adjustment (0.5). This is simple. The mathematical result becomes 1,000,000,078,124,999,936.5, but it simply gets rounded to the nearest double, resulting in no changes.

Output: 1,000,000,078,124,999,936 (subsequently cast to int64_t)

Bitwise Operations:

Finally, we apply our masks.

OFFSET=1ULL<<62 ($2^{62} = 4,611,686,018,427,387,904$)
SCALE_FLAG=1ULL<<63 ($2^{63} = 9,223,372,036,854,775,808$)

Both of these are XOR-ed into the result, since we have an int64_t. Hence, we get: 1004503677752370432 ^ OFFSET ^ SCALE_FLAG. Plugging this into Compiler Explorer, and we get 14835058133407163648.

Familiar number!

Scroll back up to the Harness's output to see for yourself!

With -march=native

Now, let's look at the instruction stream with FMA (also annotated by Gemini 3.1 Pro)!

getRunTime(double):
; --- Step 1: Determine the adjustment (Branchless Vector Operations) ---
vxorpd xmm1, xmm1, xmm1                               ; Zero out xmm1 (xmm1 = 0.0)
vcmpltsd xmm1, xmm1, xmm0                             ; Bitmask: 1s if 0.0 < price, else 0s
vmovddup xmm2, qword ptr [rip + .LCPI1_3]             ; Load 0.5 into xmm2
vblendvpd xmm1, xmm2, xmmword ptr [rip + .LCPI1_1], xmm1 ; Select 0.5 or -0.5 based on mask. Result in xmm1.

; --- Step 2: The Math (FMA) ---
vfmadd231sd xmm1, xmm0, qword ptr [rip + .LCPI1_2]    ; FUSED MULTIPLY ADD: xmm1 = (xmm0 * 10^13) + xmm1
                                                      ; Computes to infinite precision internally.
                                                      ; **[SINGLE ROUNDING EVENT]** Mantissa rounded ONCE.

; --- Step 3: Typecast and Bitwise Operations ---
vcvttsd2si rax, xmm1                                  ; Cast to integer with truncation.
mov cl, 63                                            ; Set cl register to 63
bzhi rcx, rax, rcx                                    ; BMI2 Bit Extract: Zero out high bits from index 63
movabs rax, -4611686018427387904                      ; Load (OFFSET | SCALE_FLAG)
xor rax, rcx                                          ; Apply the offset and flag via XOR.
ret

vfmadd231sd Here, the CPU computes the exact same mathematical formula, but skipping the first rounding event.

FMA, as its name suggests, fuses the multiplication and addition into a single internal step with infinite precision:

$$ (100,000.0078125 * 10,000,000,000,000) + 0.5 = 1,000,000,078,125,000,000.5 $$

Now, we must round to the nearest representable double (again, multiples of 128). Calculating the distance to its neighbours:

$$ \Delta Lower = 1,000,000,078,125,000,000.5 - 1,000,000,078,124,999,936 = 64.5 $$

$$ \Delta Upper = 1,000,000,078,125,000,064 - 1,000,000,078,125,000,000.5 = 63.5 $$

Because we added 0.5 before rounding, the value is no longer a perfect tie. It is strictly closer to the Upper representable double!

Output: 1,000,000,078,125,000,064

Doing the same final bitwise operations: 1,000,000,078,125,000,064 ^ OFFSET ^ SCALE_FLAG, and plugging this into Compiler Explorer, we get: 14835058133407163776! Another familiar number!

Conclusion

Floats are scary.
Add volatile in tests :)

Real Conclusion

Use std::fma.

Or use std::llround.

Disclaimer

AI-assisted for some parts, particularly the assembly walkthrough...

C++ Directed Acyclic Graphs (DAG) - Version 1

Sun, 15 Feb 2026 00:00:00 GMT

import { GithubCard } from 'astro-pure/advanced' import { Aside, Collapse } from 'astro-pure/user'

DAG

Problem Statement

I use Pipeline/Graph interchangeably in my blog posts.

This is because in practice, it's akin to a Pipeline, but in theory, it's like a Directed Acyclic Graph.

In Version 0, we explored how to utilise templates to compose operations at compile-time, along with the pros and cons of virtual vs templates.

Ultimately, this library is about C++ and templating. virtual is akin to inheritance, and its patterns can be found in other languages. Hence, focus on the compile-time variant!

Suppose we want to compose N operations:

using Pipeline = A<B<C<D<E<>>>>>;

Immediately, two problems are apparent:

Poor Ergonomics: The nesting of classes in template arguments is extremely prone to human error. Adding or removing a node requires careful matching of the angle brackets!
Obscured Data Flow: In a simple pipeline like this, reading left-to-right is simple enough. But eventually, when multiple control paths are introduced, the nested structure obscures the data flow.

Hence, what we really want is an intuitive interface:

using Pipeline = Graph<A<>, B<>, C<>, D<>, E<>>;

This clearly expresses: "data flows through A, then B, then..."

Using a more complex example: Suppose we have a Router with different Routes (akin to a switch in C++)

using HandleFoo = Graph<A<>, B<>>;
using HandleBar = Graph<C<>, D<>>;

using Pipeline = Graph<Router<
  Route<EventFoo, HandleFoo>,
  Route<EventBar, HandleBar>
  >,
  E<>
>;

using HandleFoo = A<B<>>;
using HandleBar = C<D<>>;

// TBH I do not even know how to express it with nested templates LOL.
// The Router needs to somehow wrap the Routes, AND chain to E<>...
using Pipeline = Router<
  Route<EventFoo, HandleFoo>,
  Route<EventBar, HandleBar>
>; // where do I even put E<>?

Template Rebinding

The core challenge is converting the flattened list of lists Graph<A<>, B<>, ...> (note the tree-like recursion!) into the nested A<B<...>> at compile-time.

This is achieved through template rebinding! It's a fairly common metaprogramming technique (found in Boost libraries too!) that recursively transforms type hierarchies.

Understanding the Node Structure

A Node (or Stage) in the Graph/Pipeline is merely a struct with a process method.

First, let's look at (simplified) nodes:

struct Successor {};

template <typename Then = Successor>
struct PassThrough : Then {
  template <typename... Args>
  auto process(Args&&... args) {
    return Then::process(std::forward<Args>(args)...);
  }
};

struct Sink {
  template <typename... Args>
  void process(Args&&...) {
    return;
  }
};

Each (internal) node:

Takes a template parameter Then representing the next node in the chain.
Inherits from Then to form a chain of types.
Implements process() which delegates to the next node via Then::process().

Nodes can be classified into two categories:

Internal Nodes: Class templates deriving from Successor (e.g., PassThrough<>).
Terminal Nodes: Plain classes that do NOT derive from Successor (e.g., Sink).

namespace meta {

template <typename T>
concept Internal = std::is_class_v<T> && std::is_base_of_v<Successor, T>;

template <typename T>
concept Terminal = std::is_class_v<T> && !std::is_base_of_v<Successor, T>;

template <typename T>
concept NodeLike = Internal<T> || Terminal<T>;

} // namespace meta

You may notice that the NodeLike concept doesn't verify the presence of a process() method (or at least, now you notice).

This is because each process() method can accept different types and numbers of arguments.

A specialised checker will need to be created, that iterates through the chain and validate each node.

This wasn't so trivial, and I didn't prioritise it.

Rebind

The Rebind template performs a search-and-replace operation on type hierarchies:

template <typename Pattern, typename Replacement, typename Target>
using Rebind = typename detail::RebindImpl<Pattern, Replacement, Target>::Type;

In plain English (:>): "In Target (a Class Template), find all occurrences of Pattern and replace them with Replacement."

Rebind Implementation

The rebinding works through template specialisation:

// Base case: Target doesn't match Pattern, return as-is.
template <typename Pattern, typename Replacement, typename Target>
struct RebindImpl {
  using Type = Target;
};

// Match found: Replace Pattern with Replacement.
template <typename Pattern, typename Replacement>
struct RebindImpl<Pattern, Replacement, Pattern> {
  using Type = Replacement;
};

// Recursive case: Target is a class template, recurse into its arguments.
template <typename Pattern,
          typename Replacement,
          template <typename...> class Target,
          typename... Args>
struct RebindImpl<Pattern, Replacement, Target<Args...>> {
  using Type = Target<typename RebindImpl<Pattern, Replacement, Args>::Type...>;
};

When Target is a template instantiation like: PassThrough<Successor>, we decompose it into:

The template itself: PassThrough.
Its arguments: Successor.

We then recursively apply Rebind to each argument, rebuilding the type with transformed arguments.

Building the Graph

The Graph implementation then uses Rebind to chain nodes together:

template <typename... Nodes>
struct Graph : GraphImpl<Nodes...>::Type {
  // Implementation inherits from the fully constructed graph
};

Looking at GraphImpl which contains the core logic:

template <typename... Nodes>
struct GraphImpl;

// Base case: Graph is a single, terminal node.
template <typename Leaf>
struct GraphImpl<Leaf> {
  using Type = Leaf;
};

// Recursive case: Process Head, recurse on Tail.
template <typename Head, typename... Tail>
struct GraphImpl<Head, Tail...> {
  // Magic happens here!
  // We rebind the ::Type of this Node (Head) to that of the Rebind-ed Tail.
  using Type = Rebind<Successor, typename GraphImpl<Tail...>::Type, Head>;
};

Step-by-Step Trace

Tracing Graph<A<>, B<>, Sink> (showing default template arguments explicitly as A<Successor>, B<Successor>):

Process Head A<Successor> with Tail B<Successor>, Sink.

// Pattern = Successor
// Replacement = GraphImpl<B<Successor>, Sink>::Type (i.e., rest of the graph)
// Target = A<Successor>
GraphImpl<A<Successor>, B<Successor>, Sink>::Type
  = Rebind<Successor, GraphImpl<B<Successor>, Sink>::Type, A<Successor>>

At this point, we need to evaluate GraphImpl<B<Successor>, Sink>::Type.

Recurse into Tail. Process Head B<Successor> with Tail Sink:

GraphImpl<B<Successor>, Sink>::Type
  = Rebind<Successor, GraphImpl<Sink>::Type, B<Successor>>
  = Rebind<Successor, Sink, B<Successor>> // Base case: GraphImpl<Sink>::Type = Sink

Apply Rebind:

Rebind<Successor, Sink, B<Successor>>

This matches the recursive case earlier! B<Successor> is our Target, a class template with arguments.

Decomposing:

Template: B
Arguments: Successor

Now, recursively rebind the arguments:

B<typename RebindImpl<Successor, Sink, Successor>::Type...>
  = B<Sink> // Pattern matched! Successor replaced with Sink

Unwinding from innermost to outermost Rebind:

GraphImpl<A<Successor>, B<Successor>, Sink>::Type
  = Rebind<Successor, B<Sink>, A<Successor>>

Again, this matches the recursive case. Decomposing A<Successor>:

Template: A
Arguments: Successor

Recursively rebind:

A<typename RebindImpl<Successor, B<Sink>, Successor>::Type...>
  = A<B<Sink>> // Pattern match! Successor replaced with B<Sink>

Visualising the Stack:

Input:  Graph<A<>, B<>, Sink>
        |
Step 1: GraphImpl recurse, process rightmost first
        Sink (terminal, base case)
        |
Step 2: B<> gets its Successor replaced with Sink
        B<Successor> -> B<Sink>
        |
Step 3: A<> gets its Successor replaced with B<Sink>
        A<Successor> -> A<B<Sink>>
        |
Output: A<B<Sink>>

How/Why this Works

This works for a multitude of reasons:

Default Template Arguments: Internal nodes default their Then parameter to Successor, giving us a consistent placeholder to Rebind.
Recursive Rebinding: Rebind then traverses the entire type hierarchy, finding and replacing all Successor placeholders at any depth.
Right-to-Left Construction: The recursion then unwinds, processing the nodes from innermost-to-outermost (i.e., right-to-left, tail-to-head), building the nested structure naturally.

Extended Example with Router

I haven't explained Router yet (at least, not fully! :D). Speedrunning!

Router is a little special in that it is composed of Nodes.

template <auto EventV, typename Node>
struct Route : Node {
  // Route inherits from Node
};

template <typename... Routes>
struct Router : Routes... {
  // Router inherits from all Routes
};

So, Router itself does not have a Then parameter. It only inherits from its Routes. To make Router chainable in a Graph, we then give each Route a Then parameter, by giving it a Graph as its Node argument:

using HandleFoo = Graph<A<>, B<>>;  // A<B<Successor>> - still open for chaining!
using HandleBar = Graph<C<>, D<>>;  // C<D<Successor>>

using Pipeline = Graph<
  PassThrough<>,
  Router<
    Route<EventFoo, HandleFoo>,  // Route inherits from A<B<Successor>>, inheriting Then parameter!
    Route<EventBar, HandleBar>   // Route inherits from C<D<Successor>>
  >,
  Sink  // Terminal
>;

When rebinding:

GraphImpl<Router<...>, Sink>::Type.
Router<...> is a class template with arguments.
The Rebind invocation looks like:

Rebind<Successor, Sink,
      Router<
        Route<EventFoo, A<B<Successor>>>,
        Route<EventBar, C<D<Successor>>>
        >
      >

Which then produces the nested structure:

Router<
  Route<EventFoo, A<B<Sink>>>,
  Route<EventBar, C<D<Sink>>>
>

Rebind traverses into the Router's template arguments (the Routes), and within each Route's template arguments (the Nodes), finds and replaces all Successor placeholders recursively!

Conclusion

Template rebinding transforms an intuitive, flat syntax into a the nested type hierarchy. By leverage C++'s template metaprogramming extensively, we get:

Ergonomic API: Graph<A<>, B<>, C<>> is clear and maintainable.
Zero Cost: No runtime overhead (as explained in V0).
Type Safety: Invalid node combinations are caught by the Compiler, rather than failing at runtime, possibly catastrophically.
Composability: Functional-Programming-like!

All made possible by Rebind!

C++ Directed Acyclic Graphs (DAG) - Version 0

Mon, 19 Jan 2026 00:00:00 GMT

import { GithubCard } from 'astro-pure/advanced' import { Aside, Collapse } from 'astro-pure/user'

DAG

Problem Statement

When processing data, we very often want to (or need-to) chain a series of discrete operations: Transform, Filter, Collect, Fan-Out, etc.

In the simplest case, these are just sequential function calls:

int increment(int x) {
  return x + 1;
}

int shiftLeft(int x) {
  return x << 1;
}

int process(int x) {
  x = increment(x);
  x = shiftLeft(x);
  return x;
}

Scalability "Wall"

As requirements grow, simple function chaining becomes difficult for a whole bunch of reasons:

State Management: What if an operation requires maintaining an internal counter, a buffer (e.g. collecting metrics)? Passing state through function parameters become unwieldy quickly (Parameter Drilling). Global State introduces thread-safety risk, makes the code difficult to reason about, and test.
Re-usability: If we want to use the shiftLeft logic in different process functions, we often run into conflicting requirements. Some use-cases require logging, others need access to a specific context. We inevitably end up with multiple versions of shiftLeft (e.g., shiftLeftWithLog, shiftLeftWithContext) with slightly different signatures to accommodate every edge cases.
Dynamic Topology: In more complex systems, we may want to "Fan-Out" data from a Stage to a "Pass-Through" stage based on configuration without rewriting the core logic. For example, in Staging, we may want to pass-through data because we only have one testing destination server. In Production, we may want to Fan-Out this data to multiple destinations.
Testing: This comes without saying. Free-functions cannot be unit-tested easily.

Version 0: Run-Time (Virtual) -> Compile-Time (Templates)

In Version 0, we solve the problems by using Run-time Polymorphism, which most of us would be familiar with (virtual in C++, interface in Go, Inheritance in Java).

While this makes the code much more flexible and maintainable than free functions, it introduces a runtime "tax" that we want to avoid in high-performance systems.

Hence, we evolve this to Compile-time polymorphism later. Just because we can in C++.

Refer to Version0.cpp on my Github for the implementation details.

Key Concepts

Version 0 treats the pipeline as a bunch of Stages. Because every stage derives (or inherits) from a common interface Then, the stages do not need to know the specific type of the "Then" stage, only that it exists.

Decoupled Topology: Stages can be swapped out.
- In the run-time version, the Adder can be swapped for a Multiplier by changing the Constructor arguments
- In the compile-time version, change the Template arguments.
Encapsulated State: If Adder needs to count how many messages it has processed, it can store a std::uint64_t count_ member internally, and is invisible to the rest of the pipeline. Note that this does not yet address the passing of State (Parameter Drilling).
Testability: We can mock Stages, allowing us to unit-test them in isolation.

Run-Time

In the Run-Time version, we define an "interface" that allows us to define a pipeline that processes std::int64_ts:

struct Then {
  virtual void process(std::int64_t) = 0;
  virtual std::int64_t result() const = 0;
};

struct Pipeline {
  Store store;
  Doubler doubler{&store}; // Doubler wraps Store
  Adder adder{&doubler};   // Adder wraps Doubler
} storage;

Then* pipeline = &storage.adder; // gets the entry-point into the Pipeline.
pipeline->process(x); // adder.process -> doubler.process -> store.process
return pipeline->result(); // returns final value of x.

This is simple-enough. But we are using C++.

Trade-Offs

The flexibility afforded by run-time polymorphism (virtual) comes at the cost of indirection.

As the name run-time implies, the CPU has to do extra work at run-time to support this "plug-and-play" feature.

Memory Tax

In the code, sizeof(Then) is 8 bytes. sizeof(Doubler) is 16 bytes

This assumes a 64-bit architecture (e.g., X86-64), where the CPU uses 64-bit memory addresses (i.e., 8 bytes).

On a 32-bit arch, sizeof(Then) would be 4 bytes instead!

Virtual Function Table Pointer (vptr): Since Doubler has virtual functions, the compiler adds a hidden pointer that points to its class's "Virtual Function Table" (vtable).
Member Pointer: The address of the Then stage: Then* then_ must be stored.

At 48 bytes for a simple 3-stage pipeline, we are already close to exhausting a single 64-byte Cache Line.

When you access a memory address, the hardware proactively grabs the surrounding 64 bytes and loads them into the L1 cache.

This is Spatial Locality! if your data is packed tightly, the CPU gets everything it needs in one trip!

❯ cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
─────┬────────────────────────────────────────────────────────────────────
     │ File: /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
─────┼────────────────────────────────────────────────────────────────────
   1 │ 64
─────┴────────────────────────────────────────────────────────────────────

Adding a few more stages or some internal state (like a counter), will cause our pipeline to span multiple cache lines, which is a biiiig problem!

The CPU can no longer retrieve the entire Pipeline in a single fetch. It is now required to issue multiple fetches from memory to pull the Pipeline into its L1 Cache.
Because the stages are linked via pointers, the CPU cannot effectively prefetch! It must wait for the then_ pointer to be resolved, before it even knows which memory address to request next.

CPU Tax

When pipeline->process(x) is invoked, the CPU cannot simply jump to the next instruction (ideally, the equivalent of addi, multi instructions in MIPS).

The CPU must:

Dereference the vptr to find the vtable.
Index into the table to find the address of the process function.
Call through the address (call instruction)

This is a form of Pointer-Chasing, which is difficult for the CPU's branch predictor to optimise, and prevents the Compiler from inlining the code. Each vtable access is also a potential cache miss.

Compile-Time

In this version, we use C++ templates to achieve (more!!) flexibility without runtime overhead.

The key difference: instead of storing a pointer to the Then stage, the Then stage is encoded directly in the template parameter.

struct Store {
  std::int64_t result_;
  void process(std::int64_t x) { result_ = x; }
  std::int64_t result() const { return result_; }
};

template <typename Then>
struct Doubler : Then {
  void process(std::int64_t x) { Then::process(x*2); }
};

template <typename Then>
struct Adder : Then {
  void process(std::int64_t x) { Then::process(x+1); }
};

using Pipeline = Adder<Doubler<Store>>;

Pipeline pipeline;
pipeline.process(x);
return pipeline.result();

Given that Pipeline is now a type, not an Object with pointers, the compiler knows the exact type at compile-time.

Benefits (over Run-Time)

Zero Indirection: No vtable lookups. When pipeline.process(x) is called, the compiler knows exactly which function to call, and can inline the entire chain. The entire pipeline collapse into a few instructions. Refer to Compile-time vs Run-time instructions generated
Reduced Memory Overhead: Since Doubler and Adder have no member variables and hold no state, sizeof(Pipeline) is just 8 bytes for the Store::result_ field, and can even be optimised further to 1 byte (return the result instead of storing), a vast difference compared to the run-time version of 48 bytes, sitting comfortably within a single cache line. NOTE: It is inaccurate to call this optimisation EBCO (Empty Base Class Optimisation)

This optimisation is how EBCO works anyways!

Trade-Offs

Static Topology: The pipeline structure is fixed at compile-time. Adder cannot be swapped for Multiplier based on a configuration file or runtime condition. Thankfully, we can fix this (later)!
Code-Bloat: For each Pipeline configuration, the compiler generates code for each instantiation. This increases binary size and can lead to instruction cache misses. However, we can minimize i-cache misses using techniques like cache warming and optimising for the Hot-Path where performance is critical. Most code paths (including the hot-path) will still see an improvement due to reduced data-cache misses.
Longer Compile Times: I guess, when run-time performance is vital, this is an acceptable trade-off.
Template Bloat: still learning. If it's interesting enough, maybe I will share it here :)

Compiler Explorer Comparison

Compiled using CXX_COMPILER=x86-64 gcc 15.2, CXX_FLAGS=-g -std=c++23 -O3. Irrelevant instructions (like main) are removed for brevity.

See source code on Compiler Explorer.

Simplified Abstract Pipeline

x -> x+1 -> (x+1) * 2

Compile-Time (Templates)

The entire pipeline is inlined, and optimised to a single lea instruction

compiletime::process(long):
        lea     rax, [rdi+2+rdi]
        ret

The lea (Load Effective Address) instruction computes rdi + rdi + 2, which is equivalent to 2*x + 2 (equivalent to (x+1)*2).

The compiler recognised the entire pipeline's computation and collapsed it into pure arithmetic - a single instruction!

Run-Time (`virtual`)

Even with -O3 optimisations, virtual function calls cannot be inlined because the compiler does not know which concrete functions will be called at runtime.

# Entry Point: Demonstrates vtable lookup overhead
runtime::process(long):
        # pipeline->process()
        sub     rsp, 8                                 # Allocate stack space
        mov     rsi, rdi                               # rsi = x (input parameter)
        mov     rdi, QWORD PTR runtime::pipeline[rip]  # load pipeline pointer
        mov     rax, QWORD PTR [rdi]                   # load vptr from object
        call    [QWORD PTR [rax]]                      # call through vtable (process function)

        # pipeline->result()
        mov     rdi, QWORD PTR runtime::pipeline[rip]  # reload pipeline pointer
        mov     rax, QWORD PTR [rdi]                   # reload vptr
        mov     rax, QWORD PTR [rax+8]                 # load vtable[8] (result function)
        add     rsp, 8                                 # clean up stack
        jmp     rax                                    # tail call to result()

# Each Stage requires its own function, with virtual function call overhead: pointer chasing
runtime::Store::process(long):
        mov     QWORD PTR [rdi+8], rsi                 # result_ = x (store)
        ret

runtime::Store::result() const:
        mov     rax, QWORD PTR [rdi+8]                 # Load result_ from memory
        ret

runtime::Doubler::process(long):
        mov     rdi, QWORD PTR [rdi+8]                 # Load then_ pointer
        add     rsi, rsi                               # x = x * 2 (actual work!)
        mov     rax, QWORD PTR [rdi]                   # load vptr from then_ object
        jmp     [QWORD PTR [rax]]                      # jump through vtable to then_->process()

runtime::Doubler::result() const:
        mov     rdi, QWORD PTR [rdi+8]                 # Load then_ pointer
        mov     rax, QWORD PTR [rdi]                   # Load vptr
        jmp     [QWORD PTR [rax+8]]                    # Jump through vtable to then_->result()

runtime::Adder::process(long):
        mov     rdi, QWORD PTR [rdi+8]                 # Load then_ pointer
        add     rsi, 1                                 # x = x + 1 (actual work done!)
        mov     rax, QWORD PTR [rdi]                   # load vptr from then_ object
        jmp     [QWORD PTR [rax]]                      # jump through vtable to then_->process

runtime::Adder::result() const:
        mov     rdi, QWORD PTR [rdi+8]                 # Load then_ pointer
        mov     rax, QWORD PTR [rdi]                   # Load vptr
        jmp     [QWORD PTR [rax+8]]                    # Jump through vtable to then_->result()

# main omitted

# Global Constructor - runs before main() to initialise vtables and objects.
_GLOBAL__sub_I_runtime::storage:
        movq    xmm0, QWORD PTR .LC0[rip]
        mov     QWORD PTR runtime::storage[rip], OFFSET FLAT:vtable for runtime::Store+16
        movhps  xmm0, QWORD PTR .LC1[rip]
        movaps  XMMWORD PTR runtime::storage[rip+16], xmm0
        movq    xmm0, QWORD PTR .LC2[rip]
        movhps  xmm0, QWORD PTR .LC3[rip]
        movaps  XMMWORD PTR runtime::storage[rip+32], xmm0
        ret

# Runtime Type Information (RTTI) - enables dynamic_cast and typeid
typeinfo name for runtime::Then:
        .string "N7runtime4ThenE"
typeinfo for runtime::Then:
        .quad   vtable for __cxxabiv1::__class_type_info+16
        .quad   typeinfo name for runtime::Then
typeinfo name for runtime::Store:
        .string "N7runtime5StoreE"
typeinfo for runtime::Store:
        .quad   vtable for __cxxabiv1::__si_class_type_info+16
        .quad   typeinfo name for runtime::Store
        .quad   typeinfo for runtime::Then
typeinfo name for runtime::Doubler:
        .string "N7runtime7DoublerE"
typeinfo for runtime::Doubler:
        .quad   vtable for __cxxabiv1::__si_class_type_info+16
        .quad   typeinfo name for runtime::Doubler
        .quad   typeinfo for runtime::Then
typeinfo name for runtime::Adder:
        .string "N7runtime5AdderE"
typeinfo for runtime::Adder:
        .quad   vtable for __cxxabiv1::__si_class_type_info+16
        .quad   typeinfo name for runtime::Adder
        .quad   typeinfo for runtime::Then

# Virtual Function Tables - Array of Function Pointers for each class
vtable for runtime::Store:
        .quad   0                                      # Offset to top
        .quad   typeinfo for runtime::Store            # RTTI pointer
        .quad   runtime::Store::process(long)          # vtable[0]: process()
        .quad   runtime::Store::result() const         # vtable[8]: result()
vtable for runtime::Doubler:
        .quad   0                                      # Offset to top
        .quad   typeinfo for runtime::Doubler          # RTTI pointer
        .quad   runtime::Doubler::process(long)        # vtable[0]: process()
        .quad   runtime::Doubler::result() const       # vtable[8]: result()
vtable for runtime::Adder:
        .quad   0                                      # Offset to top
        .quad   typeinfo for runtime::Adder            # RTTI pointer
        .quad   runtime::Adder::process(long)          # vtable[0]: process()
        .quad   runtime::Adder::result() const         # vtable[8]: result()

# Global Data
runtime::pipeline:
        .quad   runtime::storage+32                 # Points to Adder (entry point)

runtime::storage:
        .zero   48                                  # 48 bytes: Store(16)+Doubler(16)+Adder(16)

# Constants for Initialisation
.LC0:
        .quad   vtable for runtime::Doubler+16
.LC1:
        .quad   runtime::storage
.LC2:
        .quad   vtable for runtime::Adder+16
.LC3:
        .quad   runtime::storage+16

Conclusion

~~Should be self-explanatory. virtual generates so much more instructions. So much overhead.~~ IT DEPENDS.

Credits

Inspired by my internship at Squarepoint on the Trading Controls (Core Trading Services) team.

Originally called Pipeline, I decided naming it a Directed Acyclic Graph was more appealing to me, considering the ability to FanOut (and FanIn) (WIP).

Eu Chang Xian

Clock Drift (WIP, ETA: Unknown!) ₍^. .^₎⟆

Clock Drift

Floating Point Precision

Floats

Problem Statement

Background (Theory)

Basics

How Real Numbers are Represented

Converting base-10 Real Numbers to its base-2 Representation

Important Concepts

Snippet

std::numeric_limits<double>::digits10/max_digits10

Precision

Unit in the Last Place (ULP)

Rounding

Tying Concepts Together

What Breaks?

Example (Application of Theory)

Minimal Reproducible Harness

Outputs

Instruction-by-Instruction Walkthrough

Without -march=native

With -march=native

Conclusion

Real Conclusion

Disclaimer

C++ Directed Acyclic Graphs (DAG) - Version 1

DAG

Problem Statement

Template Rebinding

Understanding the Node Structure

Rebind

Rebind Implementation

Building the Graph

Step-by-Step Trace

How/Why this Works

Extended Example with Router

Conclusion

C++ Directed Acyclic Graphs (DAG) - Version 0

DAG

Problem Statement

Scalability "Wall"

Version 0: Run-Time (Virtual) -> Compile-Time (Templates)

Key Concepts

Run-Time

Trade-Offs

Memory Tax

CPU Tax

Compile-Time

Benefits (over Run-Time)

Trade-Offs

Compiler Explorer Comparison

Simplified Abstract Pipeline

Compile-Time (Templates)

Run-Time (virtual)

Conclusion

Credits

`std::numeric_limits<double>::digits10/max_digits10`

Without `-march=native`

Run-Time (`virtual`)