<?xml version="1.0" encoding="UTF-8"?><?xml-stylesheet href="/scripts/pretty-feed-v3.xsl" type="text/xsl"?><rss version="2.0" xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:h="http://www.w3.org/TR/html4/"><channel><title>Eu Chang Xian</title><description>Personal website of Eu Chang Xian, a Computer Science Undergraduate at National University of Singapore (NUS), and a C++ Software Engineer</description><link>https://euchangxian.dev</link><item><title>Clock Drift (WIP, ETA: Unknown!) ₍^. .^₎⟆</title><link>https://euchangxian.dev/blog/clock-drift</link><guid isPermaLink="true">https://euchangxian.dev/blog/clock-drift</guid><description>Maybe (not)</description><pubDate>Wed, 20 May 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;import { Aside, Collapse } from &apos;astro-pure/user&apos;&lt;/p&gt;
&lt;h1&gt;Clock Drift&lt;/h1&gt;
&lt;p&gt;Upcoming (maybe) interesting topic/problem encountered!&lt;/p&gt;</content:encoded><h:img src="undefined"/><enclosure url="undefined"/></item><item><title>Floating Point Precision</title><link>https://euchangxian.dev/blog/floating-point-precision</link><guid isPermaLink="true">https://euchangxian.dev/blog/floating-point-precision</guid><description>or more precisely - imprecision!</description><pubDate>Wed, 11 Mar 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;import { Aside, Collapse } from &apos;astro-pure/user&apos;&lt;/p&gt;
&lt;h1&gt;Floats&lt;/h1&gt;
&lt;p&gt;Interesting problem encountered during my internship O^O.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Wherever I refer to &lt;strong&gt;floats&lt;/strong&gt;, I am referring to &lt;em&gt;double&lt;/em&gt;-precision, 64-bit
floats specified by the IEEE-754 standard, corresponding to a &lt;code&gt;double&lt;/code&gt; in C++,
unless otherwise described.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Given that this problem emerged with our clang-compiled code, I will (mostly)
be discussing &lt;code&gt;clang&lt;/code&gt; only.
&lt;code&gt;gcc&lt;/code&gt; has a different default behaviour, but the theory/fundamentals are the same.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Problem Statement&lt;/h2&gt;
&lt;p&gt;Consider a &lt;code&gt;doubleToInt&lt;/code&gt; function used to normalise floats into integers for
subsequent computations:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;constexpr std::size_t NUM_BITS_VALUE = 63;
constexpr std::uint64_t SCALE = 10&apos;000&apos;000&apos;000&apos;000ULL; // log2(10^13) ~= 44 bits to represent
constexpr std::uint64_t OFFSET = 1ULL &amp;#x3C;&amp;#x3C; (NUM_BITS_VALUE - 1); // 2^62
constexpr std::uint64_t SCALE_FLAG = 1ULL &amp;#x3C;&amp;#x3C; NUM_BITS_VALUE;   // 2^63

std::uint64_t doubleToInt(double price) {
    double adjustment = (price &gt; 0) ? 0.5 : -0.5;
    return (static_cast&amp;#x3C;std::int64_t&gt;(price * SCALE + adjustment) + OFFSET) | SCALE_FLAG;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This looks extremely innocent at first-glance.&lt;/p&gt;
&lt;p&gt;However, when compiling with optimisations (non-Debug), e.g., &lt;code&gt;-DCMAKE_BUILD_TYPE=RelWithDebInfo&lt;/code&gt;,
tests that are &lt;em&gt;&quot;unlucky&quot;&lt;/em&gt; will &lt;strong&gt;fail&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;This failure is due to a &lt;strong&gt;divergence&lt;/strong&gt; in the computed value with/without optimisations.
Take this test assertion as a concrete example:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;// Note that these are adapted snippets, not the actual code.
auto parsedDouble = std::stod(&quot;110987.6543210987654&quot;);
auto computedIntPrice = doubleToInt(parsedDouble);
EXPECT_EQ(doubleToInt(110987.6543210987654), computedIntPrice);
// ...
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We would get this failure:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;[ RUN      ] MessageInterpreter.OrderBookSnapshot
test.cpp:1086: Failure
Expected equality of these values:
  doubleToInt(110987.6543210987654)
    Which is: 14944934598493151360
  computedIntPrice
    Which is: 14944934598493151232
[  FAILED  ] MessageInterpreter.OrderBookSnapshot (1 ms)
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Note that &lt;strong&gt;neither&lt;/strong&gt; of the computed values are wrong!&lt;/p&gt;
&lt;p&gt;These values can actually be obtained through the IEEE-754 standard, and are
&lt;em&gt;not&lt;/em&gt; the result of some spurious failure.&lt;/p&gt;
&lt;h2&gt;Background (Theory)&lt;/h2&gt;
&lt;p&gt;This part is a little heavy.&lt;/p&gt;
&lt;p&gt;To understand why the computed values are different, we first have to look at
how computers &lt;strong&gt;represent&lt;/strong&gt; real numbers.&lt;/p&gt;
&lt;h3&gt;Basics&lt;/h3&gt;
&lt;p&gt;This part will cover the basics of floats.&lt;/p&gt;
&lt;p&gt;If you have taken CS2100 (and it&apos;s still fresh in your head) in NUS, then this
part can be skimmed.&lt;/p&gt;
&lt;h4&gt;How Real Numbers are Represented&lt;/h4&gt;
&lt;p&gt;A &lt;code&gt;double&lt;/code&gt; is 64-bits wide. This means that it can only represent &lt;strong&gt;exactly
$2^{64}$ unique values&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Because the set of Real Numbers are &lt;strong&gt;uncountably infinite&lt;/strong&gt;,
&lt;em&gt;most&lt;/em&gt; (or infinite, actually... Cantor&apos;s Diagonalisation!) real numbers cannot be
represented &lt;strong&gt;exactly&lt;/strong&gt;, and must be &lt;strong&gt;rounded&lt;/strong&gt; to &lt;em&gt;a&lt;/em&gt; &lt;strong&gt;representable&lt;/strong&gt; value.&lt;/p&gt;
&lt;h4&gt;Converting base-10 Real Numbers to its base-2 Representation&lt;/h4&gt;
&lt;p&gt;To refresh on the algorithm to represent a base-10 number in its IEEE-754
double-precision format:
The breakdown of a &lt;code&gt;double&lt;/code&gt;&apos;s 64-bit is as such:&lt;/p&gt;
&lt;p&gt;| Sign  | Biased-Exponent | Mantissa |
| :---: | :-------------: | :------: |
| 1 bit |     11 bits     | 52 bits  |&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Sign&lt;/strong&gt;: 0 if the number is positive, 1 otherwise.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Biased-Exponent&lt;/strong&gt;: Represents the power to which the base is raised.
&lt;ul&gt;
&lt;li&gt;A $\textbf{bias} = 2^{10} - 1 = 1023$ is added to be able to represent
positive &lt;strong&gt;and&lt;/strong&gt; negative exponents.&lt;/li&gt;
&lt;li&gt;Hence, the &lt;strong&gt;true&lt;/strong&gt; exponent can represent the set of integers in the range
$[-1022, 1023]$ instead of just $[0, 2047]$.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mantissa&lt;/strong&gt;: Stores the precision digits of the number.
&lt;ul&gt;
&lt;li&gt;Given that &lt;strong&gt;non-zero&lt;/strong&gt; numbers always have a set bit in their binary
representation, we can actually represent &lt;strong&gt;52+1&lt;/strong&gt; bits of precision, with
the &lt;strong&gt;implicit&lt;/strong&gt; leading 1-bit (floating point numbers are always normalised).&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Although the exponent field has &lt;strong&gt;11 bits&lt;/strong&gt; (giving ($2^{11} = 2048$) possible values),
&lt;strong&gt;not all exponent values represent normal numbers&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Two exponent patterns are &lt;strong&gt;reserved&lt;/strong&gt; by the IEEE-754 standard:&lt;/p&gt;
&lt;p&gt;| Stored Exponent      | Meaning                                     |
| -------------------- | ------------------------------------------- |
| &lt;code&gt;00000000000&lt;/code&gt; (0)    | Used for &lt;strong&gt;subnormal numbers&lt;/strong&gt; and &lt;strong&gt;zero&lt;/strong&gt; |
| &lt;code&gt;11111111111&lt;/code&gt; (2047) | Used for &lt;strong&gt;Infinity&lt;/strong&gt; and &lt;strong&gt;NaN&lt;/strong&gt;           |&lt;/p&gt;
&lt;p&gt;Therefore, &lt;strong&gt;normalised floating-point numbers&lt;/strong&gt; only use exponent values:&lt;/p&gt;
&lt;p&gt;$$
1 \le E_{\text{stored}} \le 2046
$$&lt;/p&gt;
&lt;p&gt;The &lt;strong&gt;true exponent&lt;/strong&gt; is computed by subtracting the bias:&lt;/p&gt;
&lt;p&gt;$$
E_{\text{true}} = E_{\text{stored}} - 1023
$$&lt;/p&gt;
&lt;p&gt;This gives the usable exponent range:&lt;/p&gt;
&lt;p&gt;$$
-1022 \le E_{\text{true}} \le 1023
$$&lt;/p&gt;
&lt;p&gt;The smallest exponent (&lt;code&gt;-1022&lt;/code&gt;) corresponds to the &lt;strong&gt;smallest normalised
numbers&lt;/strong&gt;, while the largest exponent (&lt;code&gt;1023&lt;/code&gt;) corresponds to the &lt;strong&gt;largest
finite numbers&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Using a simple example, $-6.625$:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Convert the &lt;strong&gt;Magnitude&lt;/strong&gt; to Binary&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;We convert the integer and fractional parts separately.&lt;/p&gt;
&lt;p&gt;Integers can be directly converted into base-2:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Integer($6_{10}$): $110_2$&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;For Fractions, a simple algorithm is used.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;We multiply the fraction by 2, then&lt;/li&gt;
&lt;li&gt;take and truncate the leading digit (either 0 or 1).&lt;/li&gt;
&lt;li&gt;Repeat this until we get 0, i.e., $0*2 = 0$, &lt;strong&gt;OR&lt;/strong&gt; we get a &lt;em&gt;seen&lt;/em&gt;-before
fraction. In this case, the number is not perfectly representable.&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;Fraction($.625_{10}$):
&lt;ul&gt;
&lt;li&gt;$0.625_{10} * 2_{10} = 1.25_{10}$ (take the 1)&lt;/li&gt;
&lt;li&gt;$0.5_{10} * 2_{10} = 0.5_{10}$ (take the 0)&lt;/li&gt;
&lt;li&gt;$0.5_{10} * 2_{10} = 1.0_{10}$ (take the 1)&lt;/li&gt;
&lt;li&gt;Result: $101_{2}$&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Combining both parts, we get: $101.101_{2}$ (of course, there are no decimal
points in binary. See the next step!)&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;strong&gt;Normalise&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;&lt;em&gt;Shift&lt;/em&gt; the decimal point left until there is only &lt;strong&gt;one non-zero&lt;/strong&gt; digit to
the left of it.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;$110.101_{2}$ -&gt; $1.10101_{2} * 2^{2}_{10}$&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Hence, the &lt;strong&gt;true&lt;/strong&gt; exponent is $2_{10}$.&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;Calculate each component&lt;/li&gt;
&lt;/ol&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Sign (S)&lt;/strong&gt;: The number $-6.625_{10}$ is negative, so $S=1$.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Biased Exponent (E)&lt;/strong&gt;: Add the bias (recall: 1023 for double-precision floats)
&lt;ul&gt;
&lt;li&gt;$2_{10} + 1023_{10} = 1025_{10}$&lt;/li&gt;
&lt;li&gt;$1025_{10} = 10\ 0000\ 0001_{2}$&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Mantissa (M)&lt;/strong&gt;: Take the bits &lt;strong&gt;after&lt;/strong&gt; the decimal point from the
normalised number, and &lt;strong&gt;pad it to 52 bits&lt;/strong&gt;.
&lt;ul&gt;
&lt;li&gt;$.10101_{2} = 1010\ 1000\ 0000_{2}\ ...$ (not gonna show this...)&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Assemble the components $S|E|M$, we get:&lt;/p&gt;
&lt;p&gt;$1 | 10\ 0000\ 0001 | 1010\ 1000\ 0000...$&lt;/p&gt;
&lt;p&gt;Reconstructing the real number from its IEEE 754 representation is the same
steps in reverse.&lt;/p&gt;
&lt;p&gt;Left as a (trivial?) exercise for the reader.&lt;/p&gt;
&lt;h3&gt;Important Concepts&lt;/h3&gt;
&lt;p&gt;In this section, I will cover the various concepts that I deem necessary to understand
the problem explained at the start.&lt;/p&gt;
&lt;p&gt;This assumes knowledge of how floats are represented in binary as explained above.&lt;/p&gt;
&lt;p&gt;I will occasionally refer back to this snippet to explain:&lt;/p&gt;
&lt;h4&gt;Snippet&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;#include &amp;#x3C;cmath&gt;
#include &amp;#x3C;iostream&gt;
#include &amp;#x3C;format&gt;
#include &amp;#x3C;limits&gt;

static_assert(std::numeric_limits&amp;#x3C;double&gt;::digits == 52 + 1);
constexpr double MAX_REPR_INTEGER = (1ULL &amp;#x3C;&amp;#x3C; std::numeric_limits&amp;#x3C;double&gt;::digits);
static_assert(MAX_REPR_INTEGER == 9&apos;007&apos;199&apos;254&apos;740&apos;992);


int main () {
    auto prev = MAX_REPR_INTEGER - 1.0;
    auto fromPrev = std::nextafter(prev, std::numeric_limits&amp;#x3C;double&gt;::max());
    auto next = std::nextafter(MAX_REPR_INTEGER, std::numeric_limits&amp;#x3C;double&gt;::max());

    auto roundMode = __builtin_flt_rounds();

    // default to 1, &quot;round to nearest, ties to even&quot;
    std::cout &amp;#x3C;&amp;#x3C; std::format(&quot;Clang Rounding Mode: {}\n&quot;, roundMode);

    // Difference should only be 1 =&gt; step size of 1.
    std::cout &amp;#x3C;&amp;#x3C; std::format(&quot;Previous: {}, Difference: {}\n&quot;, prev, MAX_REPR_INTEGER-prev);

    // Step Size becomes 2, intermediate numbers cannot be represented.
    std::cout &amp;#x3C;&amp;#x3C; std::format(&quot;Next: {}, Difference: {}\n&quot;, next, next-MAX_REPR_INTEGER);

    // Rounded down to the previous representable number: MAX_REPR_INTEGER.
    std::cout &amp;#x3C;&amp;#x3C; std::format(&quot;Add 1.0: {}\n&quot;, MAX_REPR_INTEGER+1.0);

    // Rounded up to the next representable number: MAX_REPR_INTEGER+2
    std::cout &amp;#x3C;&amp;#x3C; std::format(&quot;Add 1.1: {}\n&quot;, MAX_REPR_INTEGER+1.1);
    return 0;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Refer to this &lt;a href=&quot;https://godbolt.org/z/rv1nd6K3E&quot;&gt;Compiler Explorer&lt;/a&gt; to verify the
output, or take my ~word~ comments for it!&lt;/p&gt;
&lt;h3&gt;&lt;code&gt;std::numeric_limits&amp;#x3C;double&gt;::digits10/max_digits10&lt;/code&gt;&lt;/h3&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;static_assert(std::numeric_limits&amp;#x3C;double&gt;::digits10 == 15);
static_assert(std::numeric_limits&amp;#x3C;double&gt;::max_digits10 == 17);
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;I believe there&apos;s some lack of clarity as to what these constants represent.&lt;/p&gt;
&lt;p&gt;While both are similarly named, they are in &lt;em&gt;diametrically opposite&lt;/em&gt; perspectives.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;digits10&lt;/code&gt;: This is the number of base-10 digits that can be represented in a
&lt;code&gt;double&lt;/code&gt; without change due to &lt;strong&gt;rounding&lt;/strong&gt; or &lt;strong&gt;overflow&lt;/strong&gt;.
This means that &lt;strong&gt;every&lt;/strong&gt; 15-digit base-10 has a &lt;strong&gt;1-1 bijective&lt;/strong&gt; mapping to a
&lt;code&gt;double&lt;/code&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Simply put, any 15-digit number can survive the text-&gt;&lt;code&gt;double&lt;/code&gt;-&gt;text round trip.&lt;/li&gt;
&lt;li&gt;Mathematically, this is $floor(53 * log_{10}(2)) = 15$&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;code&gt;max_digits10&lt;/code&gt;: This is the number of base-10 digits &lt;strong&gt;required&lt;/strong&gt; to represent
every possible values of a &lt;code&gt;double&lt;/code&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;17 digits are needed to ensure the number can survive the &lt;code&gt;double&lt;/code&gt;-&gt;text-&gt;&lt;code&gt;double&lt;/code&gt; round trip.&lt;/li&gt;
&lt;li&gt;Mathematically, this is $ceil(53 * log_{10}(2) + 1) = 17$&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;These digits are &lt;strong&gt;not&lt;/strong&gt; arbitrary. They are the consequences of the limits of
precision discussed in the following sections.&lt;/p&gt;
&lt;p&gt;(cppreference is good :thumbsup:)&lt;/p&gt;
&lt;p&gt;See the next section. It&apos;s not a magic number.&lt;/p&gt;
&lt;h4&gt;Precision&lt;/h4&gt;
&lt;p&gt;In this part, we establish that a &lt;code&gt;double&lt;/code&gt; has exactly &lt;strong&gt;53&lt;/strong&gt; bits of precision,
where we have 52 bits from the Mantissa plus the 1 &lt;em&gt;implicit leading bit&lt;/em&gt; from
normalisation.&lt;/p&gt;
&lt;p&gt;Because $2^{53} = 9,007,199,254,740,992$ (16 digits), exceed this base-10
number and we &lt;strong&gt;stop&lt;/strong&gt; being able to distinguish every number &lt;em&gt;uniquely&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;This means that we start to have a &lt;strong&gt;gap&lt;/strong&gt; between each &lt;strong&gt;&lt;em&gt;consecutive, representable&lt;/em&gt;&lt;/strong&gt;
base-10 number.&lt;/p&gt;
&lt;h4&gt;Unit in the Last Place (ULP)&lt;/h4&gt;
&lt;p&gt;To quantify the &quot;gap&quot; between representable numbers, we must introduce the
concept of &lt;strong&gt;Unit in the Last Place (ULP)&lt;/strong&gt;.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;ULP&lt;/strong&gt; is the distance between a given floating-point number and the &lt;em&gt;very next&lt;/em&gt;
representable floating-point number.
In my snippet &lt;a href=&quot;#snippet&quot;&gt;above&lt;/a&gt;, I use step-size colloquially.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;The gap grows&lt;/strong&gt;: Because the exponent scales the mantissa, the physical
distance (ULP) between consecutive representable floats gets larger as the
magnitude of the number increases, &lt;strong&gt;doubling&lt;/strong&gt; with every base-10 digit added.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Hence, if a &lt;strong&gt;mathematical&lt;/strong&gt; result falls in the &quot;gap&quot; between two representable
floats, it must be &lt;strong&gt;rounded&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;&lt;code&gt;std::nextafter&lt;/code&gt; (and other related stl functions) demonstrate this effect.&lt;/p&gt;
&lt;p&gt;In the snippet &lt;a href=&quot;#snippet&quot;&gt;above&lt;/a&gt;, we see that adding 1.0 to $9,007,199,254,740,99\textbf{2}$ results in
$9,007,199,254,740,99\textbf{4}$, instead of the expected $9,007,199,254,740,99\textbf{3}$.&lt;/p&gt;
&lt;p&gt;Note that this is &lt;strong&gt;NOT&lt;/strong&gt; considered buggy. It is &lt;em&gt;well-defined&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;See the following section on &lt;strong&gt;Rounding&lt;/strong&gt;.&lt;/p&gt;
&lt;h4&gt;Rounding&lt;/h4&gt;
&lt;p&gt;Because the set of &lt;strong&gt;representable&lt;/strong&gt; floating-point numbers is &lt;strong&gt;finite&lt;/strong&gt; and
&lt;strong&gt;unevenly spaced&lt;/strong&gt;, almost every arithmetic operation (&lt;code&gt;+&lt;/code&gt;,&lt;code&gt;-&lt;/code&gt;,&lt;code&gt;*&lt;/code&gt;,&lt;code&gt;/&lt;/code&gt;) yield
a mathematical result that &lt;strong&gt;cannot&lt;/strong&gt; be stored &lt;em&gt;exactly&lt;/em&gt;.&lt;/p&gt;
&lt;p&gt;The IEEE 754 standard dictates how these infinite-precision results are
mapped to our 53-bit constraints.&lt;/p&gt;
&lt;p&gt;Clang in particular defaults to &lt;strong&gt;Round to nearest, ties to even&lt;/strong&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;If a value falls &lt;em&gt;exactly&lt;/em&gt; in the middle of two representable floats, it
&lt;strong&gt;tie-breaks&lt;/strong&gt; by rounding to the float whose least significant bit is
0 (otherwise known as the &quot;even&quot; one).&lt;/li&gt;
&lt;li&gt;For all other values, it rounds to the &lt;strong&gt;nearest&lt;/strong&gt; representable float.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;See the above snippet and the &lt;a href=&quot;https://clang.llvm.org/docs/LanguageExtensions.html#builtin-flt-rounds-and-builtin-set-flt-rounds&quot;&gt;clang documentations&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;Tying Concepts Together&lt;/h4&gt;
&lt;p&gt;Why does the ULP become &gt;1 (2 to be exact) after $2^{53} = 9,007,199,254,740,992$?&lt;/p&gt;
&lt;p&gt;In binary, $2^{53}$ is represented as a 1 bit, followed by &lt;em&gt;53&lt;/em&gt; zeros,&lt;/p&gt;
&lt;p&gt;Recall that the Mantissa is only 52-bits long with an implicit leading 1 bit.&lt;/p&gt;
&lt;p&gt;Focusing on the Mantissa component only, this is how $2^{53}$ is represented:&lt;/p&gt;
&lt;p&gt;| Implicit Leading 1-bit | 52-bit Mantissa | Extra bit(s) |
| ---------------------: | :-------------: | :----------- |
|                      1 |  0000 ... 0000  | 0            |&lt;/p&gt;
&lt;p&gt;The very next &lt;strong&gt;representable&lt;/strong&gt; double is obtained by toggling the &lt;strong&gt;least
significant bit&lt;/strong&gt; in the Mantissa.&lt;/p&gt;
&lt;p&gt;This bit does &lt;strong&gt;NOT&lt;/strong&gt; represent $2^0 = 1$, but $2^1 = 2$, because of
the &lt;strong&gt;implicit&lt;/strong&gt; trailing extra bits.&lt;/p&gt;
&lt;p&gt;Hence, the next representable double is $2^{53} + 2$.&lt;/p&gt;
&lt;p&gt;The pattern in which the ULP increases is left as an exercise :).&lt;/p&gt;
&lt;h2&gt;What Breaks?&lt;/h2&gt;
&lt;p&gt;In unoptimised code, every single operation (e.g., $price * SCALE$) is
executed, &lt;strong&gt;rounded&lt;/strong&gt; to a 64-bit &lt;code&gt;double&lt;/code&gt;, and stored in the appropriate register.&lt;/p&gt;
&lt;p&gt;Then the next operation ($+ adjustment$) is calculated, then &lt;strong&gt;rounded&lt;/strong&gt; again.&lt;/p&gt;
&lt;p&gt;However, when compiler optimisations are turned on (&lt;code&gt;-O1&lt;/code&gt;, &lt;code&gt;-O2&lt;/code&gt;, &lt;code&gt;-O3&lt;/code&gt;), the
compiler is allowed to use &lt;strong&gt;different&lt;/strong&gt; CPU instructions to speed up math.&lt;/p&gt;
&lt;p&gt;The culprit for divergence is the &lt;strong&gt;Fused Multiply-Add (FMA)&lt;/strong&gt; instruction.&lt;/p&gt;
&lt;p&gt;FMA computes $(A * B) + C$ in a &lt;strong&gt;single discrete&lt;/strong&gt; step.&lt;/p&gt;
&lt;p&gt;Crucially, FMA calculates the exact infinite-precision result of $(A * B) + C$
and only applies &lt;strong&gt;one single rounding&lt;/strong&gt; step at the very end.&lt;/p&gt;
&lt;p&gt;Because unoptimised code, or instructions executed at runtime without using FMA
&lt;strong&gt;rounds twice&lt;/strong&gt; and optimised code using FMA &lt;strong&gt;only round once&lt;/strong&gt;,
the last bits of the mantissa can easily diverge by 1-ULP.&lt;/p&gt;
&lt;p&gt;Neither value is &quot;wrong&quot; per-se.&lt;/p&gt;
&lt;p&gt;The optimised FMA result is &lt;em&gt;mathematically&lt;/em&gt; closer to the true real number,
but it also breaks exact equality assertions like &lt;code&gt;EXPECT_EQ&lt;/code&gt; in our test suites.&lt;/p&gt;
&lt;h2&gt;Example (Application of Theory)&lt;/h2&gt;
&lt;p&gt;To prove that FMA is the culprit for the divergence, we can look at the assembly
generated by the compiler under two different targeting modes.&lt;/p&gt;
&lt;p&gt;For this section:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;CXX_COMPILER=clang 18.1.0&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;&lt;code&gt;CXX_FLAGS=-std=c++23 -stdlib=libstdc++ -O3 -ffp-contract=on&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We then examine the difference in asm instructions emitted additionally
specifying &lt;code&gt;-march=native&lt;/code&gt;.&lt;/p&gt;
&lt;h3&gt;Minimal Reproducible Harness&lt;/h3&gt;
&lt;p&gt;Likewise, see: &lt;a href=&quot;https://godbolt.org/z/ccoPbhfed&quot;&gt;Compiler Explorer&lt;/a&gt;&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;#include &amp;#x3C;cstdint&gt;
#include &amp;#x3C;cstddef&gt;
#include &amp;#x3C;limits&gt;
#include &amp;#x3C;format&gt;
#include &amp;#x3C;iostream&gt;

constexpr std::size_t NUM_BITS_VALUE = 63;
constexpr std::uint64_t SCALE = 10&apos;000&apos;000&apos;000&apos;000ULL;
constexpr std::uint64_t OFFSET = 1ULL &amp;#x3C;&amp;#x3C; (NUM_BITS_VALUE - 1);
constexpr std::uint64_t SCALE_FLAG = 1ULL &amp;#x3C;&amp;#x3C; NUM_BITS_VALUE;

constexpr std::uint64_t doubleToInt(double price) {
    double adjustment = (price &gt; 0) ? 0.5 : -0.5;
    return (static_cast&amp;#x3C;std::int64_t&gt;(price * SCALE + adjustment) + OFFSET) | SCALE_FLAG;
}

// only 13 sig figs, exact binary representation
constexpr double literal = 100&apos;000.007&apos;8125;

// Coerce compile-time evaluation
constexpr std::uint64_t getCompileTime() {
    return doubleToInt(literal);
}


// Force runtime evaluation
volatile double runtimeDouble = literal;
[[gnu::noinline]] constexpr std::uint64_t getRunTime(double val) {
    return doubleToInt(val);
}

int main() {
    // Evaluated by compiler (FMA)
    auto compileTimeResult = getCompileTime();

    // Evaluated by CPU (Separate mulsd and addsd)
    auto runtimeResult = getRunTime(runtimeDouble);

    std::cout &amp;#x3C;&amp;#x3C; std::format(&quot;{}\n&quot;, compileTimeResult);
    std::cout &amp;#x3C;&amp;#x3C; std::format(&quot;{}\n&quot;, runtimeResult);
    return 0;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Outputs&lt;/h4&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# No -march=native specified, i.e., -march=x86-64
14835058133407163776 # compile-time result
14835058133407163648 # runtime result
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;# -march=native specified
14835058133407163776 # compile-time result
14835058133407163776 # runtime result
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;We can see that by specifying &lt;code&gt;-march=native&lt;/code&gt;, both our results, in
particularly the runtime result are the &lt;strong&gt;same&lt;/strong&gt; as the compile-time ones.&lt;/p&gt;
&lt;p&gt;As such, I (reasonably) concluded that the instruction stream taken by
the compiler to fold the constants here, and in the (used-to) failed test cases
are the same, when &lt;code&gt;-march=native&lt;/code&gt; is specified.&lt;/p&gt;
&lt;h3&gt;Instruction-by-Instruction Walkthrough&lt;/h3&gt;
&lt;p&gt;Here, we look at the two different instruction streams. One without &lt;code&gt;-march=native&lt;/code&gt;,
which uses Standard SSE2 (Streaming SIMD Extensions) Instructions, and the one
with &lt;code&gt;-march=native&lt;/code&gt; (AVX/FMA Instructions).&lt;/p&gt;
&lt;p&gt;We also use the variable &lt;code&gt;double price = 100&apos;000.007&apos;8125;&lt;/code&gt; here for our calculations,
passing it into &lt;code&gt;doubleToInt(price)&lt;/code&gt;.&lt;/p&gt;
&lt;h4&gt;Without &lt;code&gt;-march=native&lt;/code&gt;&lt;/h4&gt;
&lt;p&gt;This is the assembly emitted (annotated by Gemini Pro 3.1).&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-asm&quot;&gt;getRunTime(double):
; --- Step 1: Determine the adjustment (price &gt; 0 ? 0.5 : -0.5) ---
xorpd xmm1, xmm1                      ; Zero out the xmm1 register (xmm1 = 0.0)
xor eax, eax                          ; Zero out the eax register (eax = 0)
ucomisd xmm0, xmm1                    ; Compare the input price (xmm0) with 0.0 (xmm1)
seta al                               ; Set the lowest byte (al) to 1 if price &gt; 0, else 0

; --- Step 2: The Math (Separate Multiply and Add) ---
mulsd xmm0, qword ptr [rip + .LCPI1_0]; MULTIPLY: xmm0 = price * 10,000,000,000,000.
                                      ; **[ROUNDING EVENT #1]** The 53-bit mantissa is rounded here.
lea rcx, [rip + .LCPI1_1]             ; Load the base memory address of our adjustment constants [-0.5, 0.5]
addsd xmm0, qword ptr [rcx + 8*rax]   ; ADD: xmm0 = xmm0 + adjustment (using rax as an index).
                                      ; **[ROUNDING EVENT #2]** The 53-bit mantissa is rounded AGAIN.

; --- Step 3: Typecast and Bitwise Operations ---
cvttsd2si rax, xmm0                   ; Cast to integer: Convert scalar double to 64-bit integer with truncation.
movabs rcx, 9223372036854775807       ; Load 0x7FFFFFFFFFFFFFFF (63 bits of 1s) into rcx
and rcx, rax                          ; Mask the integer to exactly 63 bits
movabs rax, -4611686018427387904      ; Load 0xC000000000000000 (OFFSET | SCALE_FLAG)
xor rax, rcx                          ; Apply the offset and flag via XOR.
ret                                   ; Return result in rax
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Assembly makes one dizzy, so let&apos;s focus on when rounding happens.&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;mulsd xmm0 ...&lt;/code&gt;: This corresponds to &lt;code&gt;price * 10&apos;000&apos;000&apos;000&apos;000&lt;/code&gt;.
In our failed tests, the results are larger than $2^{53}$.
Recall that &lt;code&gt;doubles&lt;/code&gt; have exactly 53 bits of precision.
Going beyond triggers a rounding.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Evaluating the exact mathematical product:&lt;/p&gt;
&lt;p&gt;$$
100,000.0078125 * 10,000,000,000,000 = 1,000,000,078,125,000,000
$$&lt;/p&gt;
&lt;p&gt;This is in the $10^{18}$ range. At this magnitude, the ULP is exactly
&lt;strong&gt;128&lt;/strong&gt; (again, left as an exercise for the reader, hint: $2^{59}$).&lt;/p&gt;
&lt;p&gt;Because the ULP is 128, every representable float in this neighbourhood must be
a multiple of 128.&lt;/p&gt;
&lt;p&gt;If we divide our mathematical result by the ULP:&lt;/p&gt;
&lt;p&gt;$$
1,000,000,078,125,000,000 / 128 = 7,812,500,610,351,562.5
$$&lt;/p&gt;
&lt;p&gt;The &lt;code&gt;.5&lt;/code&gt; indicates that our exact value falls &lt;strong&gt;perfectly in the middle&lt;/strong&gt; of
two representable doubles:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Lower (7,812,500,610,351,562th double): 1,000,000,078,124,999,936&lt;/li&gt;
&lt;li&gt;Upper (7,812,500,610,351,563th double): 1,000,000,078,125,000,064&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Recall clang&apos;s default rounding mode: &lt;strong&gt;Round to nearest, ties to even&lt;/strong&gt;.
Since we are in the middle, perfectly tied, the CPU rounds to the even, lower double.&lt;/p&gt;
&lt;p&gt;Output: &lt;strong&gt;1,000,000,078,124,999,936&lt;/strong&gt; (&lt;code&gt;double&lt;/code&gt;)&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;&lt;code&gt;addsd xmm0 ...&lt;/code&gt;: Here, we add our adjustment (&lt;code&gt;0.5&lt;/code&gt;). This is simple.
The mathematical result becomes 1,000,000,078,124,999,936.5, but it simply
gets rounded to the nearest double, resulting in &lt;em&gt;no changes&lt;/em&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Output: &lt;strong&gt;1,000,000,078,124,999,936&lt;/strong&gt; (subsequently cast to &lt;code&gt;int64_t&lt;/code&gt;)&lt;/p&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;Bitwise Operations:&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Finally, we apply our masks.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;code&gt;OFFSET=1ULL&amp;#x3C;&amp;#x3C;62&lt;/code&gt; ($2^{62} = 4,611,686,018,427,387,904$)&lt;/li&gt;
&lt;li&gt;&lt;code&gt;SCALE_FLAG=1ULL&amp;#x3C;&amp;#x3C;63&lt;/code&gt; ($2^{63} = 9,223,372,036,854,775,808$)&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Both of these are XOR-ed into the result, since we have an &lt;code&gt;int64_t&lt;/code&gt;.
Hence, we get: &lt;code&gt;1004503677752370432 ^ OFFSET ^ SCALE_FLAG&lt;/code&gt;.
Plugging this into &lt;a href=&quot;https://godbolt.org/z/6bMGEdE4E&quot;&gt;Compiler Explorer&lt;/a&gt;,
and we get &lt;strong&gt;14835058133407163648&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;Familiar number!&lt;/p&gt;
&lt;p&gt;Scroll back up to the Harness&apos;s &lt;a href=&quot;#outputs&quot;&gt;output&lt;/a&gt; to see for yourself!&lt;/p&gt;
&lt;h4&gt;With -march=native&lt;/h4&gt;
&lt;p&gt;Now, let&apos;s look at the instruction stream with FMA (also annotated by Gemini 3.1 Pro)!&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-asm&quot;&gt;getRunTime(double):
; --- Step 1: Determine the adjustment (Branchless Vector Operations) ---
vxorpd xmm1, xmm1, xmm1                               ; Zero out xmm1 (xmm1 = 0.0)
vcmpltsd xmm1, xmm1, xmm0                             ; Bitmask: 1s if 0.0 &amp;#x3C; price, else 0s
vmovddup xmm2, qword ptr [rip + .LCPI1_3]             ; Load 0.5 into xmm2
vblendvpd xmm1, xmm2, xmmword ptr [rip + .LCPI1_1], xmm1 ; Select 0.5 or -0.5 based on mask. Result in xmm1.

; --- Step 2: The Math (FMA) ---
vfmadd231sd xmm1, xmm0, qword ptr [rip + .LCPI1_2]    ; FUSED MULTIPLY ADD: xmm1 = (xmm0 * 10^13) + xmm1
                                                      ; Computes to infinite precision internally.
                                                      ; **[SINGLE ROUNDING EVENT]** Mantissa rounded ONCE.

; --- Step 3: Typecast and Bitwise Operations ---
vcvttsd2si rax, xmm1                                  ; Cast to integer with truncation.
mov cl, 63                                            ; Set cl register to 63
bzhi rcx, rax, rcx                                    ; BMI2 Bit Extract: Zero out high bits from index 63
movabs rax, -4611686018427387904                      ; Load (OFFSET | SCALE_FLAG)
xor rax, rcx                                          ; Apply the offset and flag via XOR.
ret
&lt;/code&gt;&lt;/pre&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;vfmadd231sd&lt;/code&gt;
Here, the CPU computes the exact same mathematical formula, but &lt;strong&gt;skipping the
first&lt;/strong&gt; rounding event.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;FMA, as its name suggests, fuses the multiplication and addition into a
&lt;strong&gt;single internal step&lt;/strong&gt; with infinite precision:&lt;/p&gt;
&lt;p&gt;$$
(100,000.0078125 * 10,000,000,000,000) + 0.5 = 1,000,000,078,125,000,000.5
$$&lt;/p&gt;
&lt;p&gt;Now, we must round to the nearest representable &lt;code&gt;double&lt;/code&gt; (again, multiples of 128).
Calculating the distance to its neighbours:&lt;/p&gt;
&lt;p&gt;$$
\Delta Lower = 1,000,000,078,125,000,000.5 - 1,000,000,078,124,999,936 = 64.5
$$&lt;/p&gt;
&lt;p&gt;$$
\Delta Upper = 1,000,000,078,125,000,064 - 1,000,000,078,125,000,000.5 = 63.5
$$&lt;/p&gt;
&lt;p&gt;Because we added &lt;code&gt;0.5&lt;/code&gt; before rounding, the value is no longer a perfect tie.
It is strictly closer to the &lt;strong&gt;Upper&lt;/strong&gt; representable &lt;code&gt;double&lt;/code&gt;!&lt;/p&gt;
&lt;p&gt;Output: &lt;strong&gt;1,000,000,078,125,000,064&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;Doing the same final bitwise operations: &lt;code&gt;1,000,000,078,125,000,064 ^ OFFSET ^ SCALE_FLAG&lt;/code&gt;,
and plugging this into &lt;a href=&quot;https://godbolt.org/z/xYWEWde1d&quot;&gt;Compiler Explorer&lt;/a&gt;,
we get: &lt;strong&gt;14835058133407163776&lt;/strong&gt;!
Another familiar number!&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;ul&gt;
&lt;li&gt;Floats are scary.&lt;/li&gt;
&lt;li&gt;Add &lt;code&gt;volatile&lt;/code&gt; in tests :)&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Real Conclusion&lt;/h2&gt;
&lt;p&gt;Use &lt;code&gt;std::fma&lt;/code&gt;.&lt;/p&gt;
&lt;p&gt;Or use &lt;code&gt;std::llround&lt;/code&gt;.&lt;/p&gt;
&lt;h2&gt;Disclaimer&lt;/h2&gt;
&lt;p&gt;AI-assisted for some parts, particularly the assembly walkthrough...&lt;/p&gt;</content:encoded><h:img src="undefined"/><enclosure url="undefined"/></item><item><title>C++ Directed Acyclic Graphs (DAG) - Version 1</title><link>https://euchangxian.dev/blog/dag-v1</link><guid isPermaLink="true">https://euchangxian.dev/blog/dag-v1</guid><description>Advanced Template Metaprogramming!</description><pubDate>Sun, 15 Feb 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;import { GithubCard } from &apos;astro-pure/advanced&apos;
import { Aside, Collapse } from &apos;astro-pure/user&apos;&lt;/p&gt;
&lt;h1&gt;DAG&lt;/h1&gt;
&lt;h2&gt;Problem Statement&lt;/h2&gt;
&lt;p&gt;I use &lt;strong&gt;Pipeline&lt;/strong&gt;/&lt;strong&gt;Graph&lt;/strong&gt; &lt;strong&gt;interchangeably&lt;/strong&gt; in my blog posts.&lt;/p&gt;
&lt;p&gt;This is because in practice, it&apos;s akin to a Pipeline, but in theory, it&apos;s
like a Directed Acyclic Graph.&lt;/p&gt;
&lt;p&gt;In &lt;a href=&quot;/blog/dag_v0&quot;&gt;Version 0&lt;/a&gt;, we explored how to utilise templates
to compose operations at compile-time, along with the pros and cons of &lt;code&gt;virtual&lt;/code&gt;
vs templates.&lt;/p&gt;
&lt;p&gt;Ultimately, this library is about C++ and templating.
&lt;code&gt;virtual&lt;/code&gt; is akin to inheritance, and its patterns can be found in other languages.
Hence, focus on the compile-time variant!&lt;/p&gt;
&lt;p&gt;Suppose we want to compose N operations:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;using Pipeline = A&amp;#x3C;B&amp;#x3C;C&amp;#x3C;D&amp;#x3C;E&amp;#x3C;&gt;&gt;&gt;&gt;&gt;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Immediately, two problems are apparent:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Poor Ergonomics&lt;/strong&gt;: The nesting of classes in template arguments is
extremely prone to human error. Adding or removing a node requires careful
matching of the angle brackets!&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Obscured Data Flow&lt;/strong&gt;: In a simple pipeline like this, reading left-to-right
is simple enough. But eventually, when multiple control paths are introduced,
the nested structure obscures the data flow.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;Hence, what we really want is an &lt;strong&gt;intuitive interface&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;using Pipeline = Graph&amp;#x3C;A&amp;#x3C;&gt;, B&amp;#x3C;&gt;, C&amp;#x3C;&gt;, D&amp;#x3C;&gt;, E&amp;#x3C;&gt;&gt;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This clearly expresses: &quot;data flows through A, then B, then...&quot;&lt;/p&gt;
&lt;p&gt;Using a more complex example:
Suppose we have a &lt;code&gt;Router&lt;/code&gt; with different &lt;code&gt;Routes&lt;/code&gt; (akin to a &lt;code&gt;switch&lt;/code&gt; in C++)&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;using HandleFoo = Graph&amp;#x3C;A&amp;#x3C;&gt;, B&amp;#x3C;&gt;&gt;;
using HandleBar = Graph&amp;#x3C;C&amp;#x3C;&gt;, D&amp;#x3C;&gt;&gt;;

using Pipeline = Graph&amp;#x3C;Router&amp;#x3C;
  Route&amp;#x3C;EventFoo, HandleFoo&gt;,
  Route&amp;#x3C;EventBar, HandleBar&gt;
  &gt;,
  E&amp;#x3C;&gt;
&gt;;
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;using HandleFoo = A&amp;#x3C;B&amp;#x3C;&gt;&gt;;
using HandleBar = C&amp;#x3C;D&amp;#x3C;&gt;&gt;;

// TBH I do not even know how to express it with nested templates LOL.
// The Router needs to somehow wrap the Routes, AND chain to E&amp;#x3C;&gt;...
using Pipeline = Router&amp;#x3C;
  Route&amp;#x3C;EventFoo, HandleFoo&gt;,
  Route&amp;#x3C;EventBar, HandleBar&gt;
&gt;; // where do I even put E&amp;#x3C;&gt;?
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Template Rebinding&lt;/h2&gt;
&lt;p&gt;The core challenge is converting the flattened list of lists &lt;code&gt;Graph&amp;#x3C;A&amp;#x3C;&gt;, B&amp;#x3C;&gt;, ...&gt;&lt;/code&gt;
(note the &lt;em&gt;tree-like recursion&lt;/em&gt;!) into the nested &lt;code&gt;A&amp;#x3C;B&amp;#x3C;...&gt;&gt;&lt;/code&gt; at compile-time.&lt;/p&gt;
&lt;p&gt;This is achieved through &lt;strong&gt;template rebinding&lt;/strong&gt;! It&apos;s a fairly common metaprogramming
technique (found in Boost libraries too!) that recursively transforms type hierarchies.&lt;/p&gt;
&lt;h3&gt;Understanding the Node Structure&lt;/h3&gt;
&lt;p&gt;A Node (or Stage) in the Graph/Pipeline is merely a struct with a &lt;code&gt;process&lt;/code&gt;
method.&lt;/p&gt;
&lt;p&gt;First, let&apos;s look at (simplified) nodes:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;struct Successor {};

template &amp;#x3C;typename Then = Successor&gt;
struct PassThrough : Then {
  template &amp;#x3C;typename... Args&gt;
  auto process(Args&amp;#x26;&amp;#x26;... args) {
    return Then::process(std::forward&amp;#x3C;Args&gt;(args)...);
  }
};

struct Sink {
  template &amp;#x3C;typename... Args&gt;
  void process(Args&amp;#x26;&amp;#x26;...) {
    return;
  }
};
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Each (internal) node:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Takes a template parameter &lt;code&gt;Then&lt;/code&gt; representing the next node in the chain.&lt;/li&gt;
&lt;li&gt;Inherits from &lt;code&gt;Then&lt;/code&gt; to form a &lt;em&gt;chain of types&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Implements &lt;code&gt;process()&lt;/code&gt; which &lt;em&gt;delegates&lt;/em&gt; to the next node via &lt;code&gt;Then::process()&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Nodes can be classified into two categories:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Internal&lt;/strong&gt; Nodes: &lt;strong&gt;Class templates&lt;/strong&gt; deriving from &lt;code&gt;Successor&lt;/code&gt; (e.g., &lt;code&gt;PassThrough&amp;#x3C;&gt;&lt;/code&gt;).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Terminal&lt;/strong&gt; Nodes: &lt;strong&gt;Plain&lt;/strong&gt; classes that do &lt;strong&gt;NOT&lt;/strong&gt; derive from &lt;code&gt;Successor&lt;/code&gt; (e.g., &lt;code&gt;Sink&lt;/code&gt;).&lt;/li&gt;
&lt;/ul&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;namespace meta {

template &amp;#x3C;typename T&gt;
concept Internal = std::is_class_v&amp;#x3C;T&gt; &amp;#x26;&amp;#x26; std::is_base_of_v&amp;#x3C;Successor, T&gt;;

template &amp;#x3C;typename T&gt;
concept Terminal = std::is_class_v&amp;#x3C;T&gt; &amp;#x26;&amp;#x26; !std::is_base_of_v&amp;#x3C;Successor, T&gt;;

template &amp;#x3C;typename T&gt;
concept NodeLike = Internal&amp;#x3C;T&gt; || Terminal&amp;#x3C;T&gt;;

} // namespace meta
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;You may notice that the &lt;code&gt;NodeLike&lt;/code&gt; concept doesn&apos;t verify the presence of a
&lt;code&gt;process()&lt;/code&gt; method (or at least, now you notice).&lt;/p&gt;
&lt;p&gt;This is because each &lt;code&gt;process()&lt;/code&gt; method can accept different types and numbers of
arguments.&lt;/p&gt;
&lt;p&gt;A specialised checker will need to be created, that iterates through the chain
and validate each node.&lt;/p&gt;
&lt;p&gt;This wasn&apos;t so trivial, and I didn&apos;t prioritise it.&lt;/p&gt;
&lt;h3&gt;Rebind&lt;/h3&gt;
&lt;p&gt;The &lt;code&gt;Rebind&lt;/code&gt; template performs a search-and-replace operation on type hierarchies:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;template &amp;#x3C;typename Pattern, typename Replacement, typename Target&gt;
using Rebind = typename detail::RebindImpl&amp;#x3C;Pattern, Replacement, Target&gt;::Type;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;In plain English (:&gt;): &quot;In &lt;code&gt;Target&lt;/code&gt; (a &lt;strong&gt;Class Template&lt;/strong&gt;), find all occurrences
of &lt;code&gt;Pattern&lt;/code&gt; and replace them with &lt;code&gt;Replacement&lt;/code&gt;.&quot;&lt;/p&gt;
&lt;h4&gt;Rebind Implementation&lt;/h4&gt;
&lt;p&gt;The rebinding works through &lt;strong&gt;template specialisation&lt;/strong&gt;:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;// Base case: Target doesn&apos;t match Pattern, return as-is.
template &amp;#x3C;typename Pattern, typename Replacement, typename Target&gt;
struct RebindImpl {
  using Type = Target;
};

// Match found: Replace Pattern with Replacement.
template &amp;#x3C;typename Pattern, typename Replacement&gt;
struct RebindImpl&amp;#x3C;Pattern, Replacement, Pattern&gt; {
  using Type = Replacement;
};

// Recursive case: Target is a class template, recurse into its arguments.
template &amp;#x3C;typename Pattern,
          typename Replacement,
          template &amp;#x3C;typename...&gt; class Target,
          typename... Args&gt;
struct RebindImpl&amp;#x3C;Pattern, Replacement, Target&amp;#x3C;Args...&gt;&gt; {
  using Type = Target&amp;#x3C;typename RebindImpl&amp;#x3C;Pattern, Replacement, Args&gt;::Type...&gt;;
};
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When &lt;code&gt;Target&lt;/code&gt; is a template instantiation like: &lt;code&gt;PassThrough&amp;#x3C;Successor&gt;&lt;/code&gt;, we
decompose it into:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The template itself: &lt;code&gt;PassThrough&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;Its arguments: &lt;code&gt;Successor&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;We then &lt;strong&gt;recursively&lt;/strong&gt; apply Rebind to each &lt;strong&gt;argument&lt;/strong&gt;, rebuilding the type
with transformed arguments.&lt;/p&gt;
&lt;h4&gt;Building the Graph&lt;/h4&gt;
&lt;p&gt;The &lt;code&gt;Graph&lt;/code&gt; implementation then uses &lt;code&gt;Rebind&lt;/code&gt; to chain nodes together:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;template &amp;#x3C;typename... Nodes&gt;
struct Graph : GraphImpl&amp;#x3C;Nodes...&gt;::Type {
  // Implementation inherits from the fully constructed graph
};
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Looking at &lt;code&gt;GraphImpl&lt;/code&gt; which contains the core logic:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;template &amp;#x3C;typename... Nodes&gt;
struct GraphImpl;

// Base case: Graph is a single, terminal node.
template &amp;#x3C;typename Leaf&gt;
struct GraphImpl&amp;#x3C;Leaf&gt; {
  using Type = Leaf;
};

// Recursive case: Process Head, recurse on Tail.
template &amp;#x3C;typename Head, typename... Tail&gt;
struct GraphImpl&amp;#x3C;Head, Tail...&gt; {
  // Magic happens here!
  // We rebind the ::Type of this Node (Head) to that of the Rebind-ed Tail.
  using Type = Rebind&amp;#x3C;Successor, typename GraphImpl&amp;#x3C;Tail...&gt;::Type, Head&gt;;
};
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Step-by-Step Trace&lt;/h4&gt;
&lt;p&gt;Tracing &lt;code&gt;Graph&amp;#x3C;A&amp;#x3C;&gt;, B&amp;#x3C;&gt;, Sink&gt;&lt;/code&gt; (showing default template
arguments explicitly as &lt;code&gt;A&amp;#x3C;Successor&gt;&lt;/code&gt;, &lt;code&gt;B&amp;#x3C;Successor&gt;&lt;/code&gt;):&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Process Head &lt;code&gt;A&amp;#x3C;Successor&gt;&lt;/code&gt; with Tail &lt;code&gt;B&amp;#x3C;Successor&gt;, Sink&lt;/code&gt;.&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;// Pattern = Successor
// Replacement = GraphImpl&amp;#x3C;B&amp;#x3C;Successor&gt;, Sink&gt;::Type (i.e., rest of the graph)
// Target = A&amp;#x3C;Successor&gt;
GraphImpl&amp;#x3C;A&amp;#x3C;Successor&gt;, B&amp;#x3C;Successor&gt;, Sink&gt;::Type
  = Rebind&amp;#x3C;Successor, GraphImpl&amp;#x3C;B&amp;#x3C;Successor&gt;, Sink&gt;::Type, A&amp;#x3C;Successor&gt;&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;At this point, we need to evaluate &lt;code&gt;GraphImpl&amp;#x3C;B&amp;#x3C;Successor&gt;, Sink&gt;::Type&lt;/code&gt;.&lt;/p&gt;
&lt;ol start=&quot;2&quot;&gt;
&lt;li&gt;Recurse into Tail. Process Head &lt;code&gt;B&amp;#x3C;Successor&gt;&lt;/code&gt; with Tail &lt;code&gt;Sink&lt;/code&gt;:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;GraphImpl&amp;#x3C;B&amp;#x3C;Successor&gt;, Sink&gt;::Type
  = Rebind&amp;#x3C;Successor, GraphImpl&amp;#x3C;Sink&gt;::Type, B&amp;#x3C;Successor&gt;&gt;
  = Rebind&amp;#x3C;Successor, Sink, B&amp;#x3C;Successor&gt;&gt; // Base case: GraphImpl&amp;#x3C;Sink&gt;::Type = Sink
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;3&quot;&gt;
&lt;li&gt;Apply Rebind:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;Rebind&amp;#x3C;Successor, Sink, B&amp;#x3C;Successor&gt;&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This matches the recursive case earlier! &lt;code&gt;B&amp;#x3C;Successor&gt;&lt;/code&gt; is our Target, a
&lt;strong&gt;class template&lt;/strong&gt; with arguments.&lt;/p&gt;
&lt;p&gt;Decomposing:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Template: &lt;code&gt;B&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Arguments: &lt;code&gt;Successor&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Now, recursively rebind the arguments:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;B&amp;#x3C;typename RebindImpl&amp;#x3C;Successor, Sink, Successor&gt;::Type...&gt;
  = B&amp;#x3C;Sink&gt; // Pattern matched! Successor replaced with Sink
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;Unwinding from innermost to outermost Rebind:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;GraphImpl&amp;#x3C;A&amp;#x3C;Successor&gt;, B&amp;#x3C;Successor&gt;, Sink&gt;::Type
  = Rebind&amp;#x3C;Successor, B&amp;#x3C;Sink&gt;, A&amp;#x3C;Successor&gt;&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Again, this matches the recursive case.
Decomposing &lt;code&gt;A&amp;#x3C;Successor&gt;&lt;/code&gt;:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;Template: &lt;code&gt;A&lt;/code&gt;&lt;/li&gt;
&lt;li&gt;Arguments: &lt;code&gt;Successor&lt;/code&gt;&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;Recursively rebind:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;A&amp;#x3C;typename RebindImpl&amp;#x3C;Successor, B&amp;#x3C;Sink&gt;, Successor&gt;::Type...&gt;
  = A&amp;#x3C;B&amp;#x3C;Sink&gt;&gt; // Pattern match! Successor replaced with B&amp;#x3C;Sink&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Visualising the Stack:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-txt&quot;&gt;Input:  Graph&amp;#x3C;A&amp;#x3C;&gt;, B&amp;#x3C;&gt;, Sink&gt;
        |
Step 1: GraphImpl recurse, process rightmost first
        Sink (terminal, base case)
        |
Step 2: B&amp;#x3C;&gt; gets its Successor replaced with Sink
        B&amp;#x3C;Successor&gt; -&gt; B&amp;#x3C;Sink&gt;
        |
Step 3: A&amp;#x3C;&gt; gets its Successor replaced with B&amp;#x3C;Sink&gt;
        A&amp;#x3C;Successor&gt; -&gt; A&amp;#x3C;B&amp;#x3C;Sink&gt;&gt;
        |
Output: A&amp;#x3C;B&amp;#x3C;Sink&gt;&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;How/Why this Works&lt;/h4&gt;
&lt;p&gt;This works for a multitude of reasons:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Default Template Arguments&lt;/strong&gt;: &lt;strong&gt;Internal&lt;/strong&gt; nodes default their &lt;code&gt;Then&lt;/code&gt;
parameter to &lt;code&gt;Successor&lt;/code&gt;, giving us a consistent placeholder to &lt;em&gt;Rebind&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Recursive Rebinding&lt;/strong&gt;: &lt;code&gt;Rebind&lt;/code&gt; then traverses the entire type hierarchy,
finding and replacing all &lt;code&gt;Successor&lt;/code&gt; placeholders at any depth.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Right-to-Left&lt;/strong&gt; Construction: The recursion then unwinds, processing the
nodes from innermost-to-outermost (i.e., right-to-left, tail-to-head),
building the nested structure naturally.&lt;/li&gt;
&lt;/ol&gt;
&lt;h4&gt;Extended Example with Router&lt;/h4&gt;
&lt;p&gt;I haven&apos;t explained &lt;code&gt;Router&lt;/code&gt; yet (at least, not fully! :D).
Speedrunning!&lt;/p&gt;
&lt;p&gt;Router is a little special in that it is composed of Nodes.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;template &amp;#x3C;auto EventV, typename Node&gt;
struct Route : Node {
  // Route inherits from Node
};

template &amp;#x3C;typename... Routes&gt;
struct Router : Routes... {
  // Router inherits from all Routes
};
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;So, &lt;code&gt;Router&lt;/code&gt; itself does not have a &lt;code&gt;Then&lt;/code&gt; parameter. It only inherits from its
Routes.
To make Router chainable in a Graph, we then give each &lt;code&gt;Route&lt;/code&gt; a &lt;code&gt;Then&lt;/code&gt;
parameter, by giving it a &lt;code&gt;Graph&lt;/code&gt; as its &lt;code&gt;Node&lt;/code&gt; argument:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;using HandleFoo = Graph&amp;#x3C;A&amp;#x3C;&gt;, B&amp;#x3C;&gt;&gt;;  // A&amp;#x3C;B&amp;#x3C;Successor&gt;&gt; - still open for chaining!
using HandleBar = Graph&amp;#x3C;C&amp;#x3C;&gt;, D&amp;#x3C;&gt;&gt;;  // C&amp;#x3C;D&amp;#x3C;Successor&gt;&gt;

using Pipeline = Graph&amp;#x3C;
  PassThrough&amp;#x3C;&gt;,
  Router&amp;#x3C;
    Route&amp;#x3C;EventFoo, HandleFoo&gt;,  // Route inherits from A&amp;#x3C;B&amp;#x3C;Successor&gt;&gt;, inheriting Then parameter!
    Route&amp;#x3C;EventBar, HandleBar&gt;   // Route inherits from C&amp;#x3C;D&amp;#x3C;Successor&gt;&gt;
  &gt;,
  Sink  // Terminal
&gt;;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;When rebinding:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;code&gt;GraphImpl&amp;#x3C;Router&amp;#x3C;...&gt;, Sink&gt;::Type&lt;/code&gt;.&lt;/li&gt;
&lt;li&gt;&lt;code&gt;Router&amp;#x3C;...&gt;&lt;/code&gt; is a class template with arguments.&lt;/li&gt;
&lt;li&gt;The Rebind invocation looks like:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;Rebind&amp;#x3C;Successor, Sink,
      Router&amp;#x3C;
        Route&amp;#x3C;EventFoo, A&amp;#x3C;B&amp;#x3C;Successor&gt;&gt;&gt;,
        Route&amp;#x3C;EventBar, C&amp;#x3C;D&amp;#x3C;Successor&gt;&gt;&gt;
        &gt;
      &gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;ol start=&quot;4&quot;&gt;
&lt;li&gt;Which then produces the nested structure:&lt;/li&gt;
&lt;/ol&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;Router&amp;#x3C;
  Route&amp;#x3C;EventFoo, A&amp;#x3C;B&amp;#x3C;Sink&gt;&gt;&gt;,
  Route&amp;#x3C;EventBar, C&amp;#x3C;D&amp;#x3C;Sink&gt;&gt;&gt;
&gt;
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Rebind traverses into the Router&apos;s template arguments (the Routes), and within
each Route&apos;s template arguments (the Nodes), &lt;strong&gt;finds and replaces&lt;/strong&gt; all Successor
placeholders &lt;em&gt;recursively&lt;/em&gt;!&lt;/p&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;Template rebinding transforms an intuitive, flat syntax into a the nested type
hierarchy.
By leverage C++&apos;s template metaprogramming extensively, we get:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Ergonomic API&lt;/strong&gt;: &lt;code&gt;Graph&amp;#x3C;A&amp;#x3C;&gt;, B&amp;#x3C;&gt;, C&amp;#x3C;&gt;&gt;&lt;/code&gt; is clear and maintainable.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Zero Cost&lt;/strong&gt;: No runtime overhead (as explained in V0).&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Type Safety&lt;/strong&gt;: Invalid node combinations are caught by the Compiler, rather
than failing at runtime, possibly catastrophically.&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Composability&lt;/strong&gt;: Functional-Programming-like!&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;All made possible by &lt;code&gt;Rebind&lt;/code&gt;!&lt;/p&gt;</content:encoded><h:img src="undefined"/><enclosure url="undefined"/></item><item><title>C++ Directed Acyclic Graphs (DAG) - Version 0</title><link>https://euchangxian.dev/blog/dag-v0</link><guid isPermaLink="true">https://euchangxian.dev/blog/dag-v0</guid><description>Compile Time Pipelines!</description><pubDate>Mon, 19 Jan 2026 00:00:00 GMT</pubDate><content:encoded>&lt;p&gt;import { GithubCard } from &apos;astro-pure/advanced&apos;
import { Aside, Collapse } from &apos;astro-pure/user&apos;&lt;/p&gt;
&lt;h1&gt;DAG&lt;/h1&gt;
&lt;h2&gt;Problem Statement&lt;/h2&gt;
&lt;p&gt;When processing data, we very often want to (or need-to) chain a series of discrete
operations: Transform, Filter, Collect, Fan-Out, etc.&lt;/p&gt;
&lt;p&gt;In the simplest case, these are just sequential function calls:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;int increment(int x) {
  return x + 1;
}

int shiftLeft(int x) {
  return x &amp;#x3C;&amp;#x3C; 1;
}

int process(int x) {
  x = increment(x);
  x = shiftLeft(x);
  return x;
}
&lt;/code&gt;&lt;/pre&gt;
&lt;h3&gt;Scalability &quot;Wall&quot;&lt;/h3&gt;
&lt;p&gt;As requirements grow, simple function chaining becomes difficult for a whole bunch
of reasons:&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;State Management: What if an operation requires maintaining an internal counter, a buffer
(e.g. collecting metrics)?
Passing state through function parameters become unwieldy quickly (Parameter Drilling).
Global State introduces thread-safety risk, makes the code difficult to reason about, and test.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Re-usability: If we want to use the &lt;code&gt;shiftLeft&lt;/code&gt; logic in different &lt;code&gt;process&lt;/code&gt; functions,
we often run into conflicting requirements.
Some use-cases require logging, others need access to a specific context.
We inevitably end up with multiple versions of &lt;code&gt;shiftLeft&lt;/code&gt; (e.g., &lt;code&gt;shiftLeftWithLog&lt;/code&gt;, &lt;code&gt;shiftLeftWithContext&lt;/code&gt;)
with slightly different signatures to accommodate every edge cases.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Dynamic Topology: In more complex systems, we may want to &quot;Fan-Out&quot; data from a Stage to a
&quot;Pass-Through&quot; stage based on configuration without rewriting the core logic.
For example, in Staging, we may want to pass-through data because we only have one testing
destination server.
In Production, we may want to Fan-Out this data to multiple destinations.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Testing: This comes without saying. Free-functions cannot be unit-tested easily.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h2&gt;Version 0: Run-Time (Virtual) -&gt; Compile-Time (Templates)&lt;/h2&gt;
&lt;p&gt;In Version 0, we solve the problems by using Run-time Polymorphism, which most of us
would be familiar with (&lt;code&gt;virtual&lt;/code&gt; in C++, &lt;code&gt;interface&lt;/code&gt; in Go, Inheritance in Java).&lt;/p&gt;
&lt;p&gt;While this makes the code much more flexible and maintainable than free functions,
it introduces a runtime &quot;tax&quot; that we want to avoid in high-performance systems.&lt;/p&gt;
&lt;p&gt;Hence, we evolve this to Compile-time polymorphism later. Just because we can in C++.&lt;/p&gt;
&lt;p&gt;Refer to &lt;a href=&quot;https://github.com/euchangxian/Goose/blob/main/src/dag/examples/Version0.cpp&quot;&gt;Version0.cpp on my Github&lt;/a&gt; for the implementation details.&lt;/p&gt;
&lt;h3&gt;Key Concepts&lt;/h3&gt;
&lt;p&gt;Version 0 treats the pipeline as a bunch of Stages.
Because every stage derives (or inherits) from a common interface &lt;code&gt;Then&lt;/code&gt;, the
stages do not need to know the specific type of the &quot;Then&quot; stage, only that it exists.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Decoupled Topology: Stages can be swapped out.&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;In the run-time version, the &lt;code&gt;Adder&lt;/code&gt; can be swapped for a &lt;code&gt;Multiplier&lt;/code&gt; by changing the Constructor arguments&lt;/li&gt;
&lt;li&gt;In the compile-time version, change the Template arguments.&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Encapsulated State: If &lt;code&gt;Adder&lt;/code&gt; needs to count how many messages it has processed,
it can store a &lt;code&gt;std::uint64_t count_&lt;/code&gt; member internally, and is invisible to the
rest of the pipeline.
Note that this does not yet address the passing of State (Parameter Drilling).&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Testability: We can mock Stages, allowing us to unit-test them in isolation.&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Run-Time&lt;/h3&gt;
&lt;p&gt;In the Run-Time version, we define an &quot;interface&quot; that allows us to define a
pipeline that processes &lt;code&gt;std::int64_t&lt;/code&gt;s:&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;struct Then {
  virtual void process(std::int64_t) = 0;
  virtual std::int64_t result() const = 0;
};
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;struct Pipeline {
  Store store;
  Doubler doubler{&amp;#x26;store}; // Doubler wraps Store
  Adder adder{&amp;#x26;doubler};   // Adder wraps Doubler
} storage;
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;Then* pipeline = &amp;#x26;storage.adder; // gets the entry-point into the Pipeline.
pipeline-&gt;process(x); // adder.process -&gt; doubler.process -&gt; store.process
return pipeline-&gt;result(); // returns final value of x.
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;This is simple-enough. But we are using C++.&lt;/p&gt;
&lt;h4&gt;Trade-Offs&lt;/h4&gt;
&lt;p&gt;The flexibility afforded by run-time polymorphism (&lt;code&gt;virtual&lt;/code&gt;) comes at the cost
of &lt;strong&gt;indirection&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;As the name &lt;strong&gt;run-time&lt;/strong&gt; implies, the CPU has to do extra work at run-time to
support this &quot;plug-and-play&quot; feature.&lt;/p&gt;
&lt;h5&gt;Memory Tax&lt;/h5&gt;
&lt;p&gt;In the code, &lt;code&gt;sizeof(Then)&lt;/code&gt; is &lt;strong&gt;8 bytes&lt;/strong&gt;. &lt;code&gt;sizeof(Doubler)&lt;/code&gt; is &lt;strong&gt;16 bytes&lt;/strong&gt;&lt;/p&gt;
&lt;p&gt;This assumes a &lt;strong&gt;64-bit&lt;/strong&gt; architecture (e.g., X86-64), where the CPU uses 64-bit
memory addresses (i.e., &lt;strong&gt;8 bytes&lt;/strong&gt;).&lt;/p&gt;
&lt;p&gt;On a 32-bit arch, &lt;code&gt;sizeof(Then)&lt;/code&gt; would be &lt;strong&gt;4 bytes&lt;/strong&gt; instead!&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Virtual Function Table Pointer (vptr)&lt;/strong&gt;: Since &lt;code&gt;Doubler&lt;/code&gt; has virtual functions,
the compiler adds a hidden pointer that points to its class&apos;s &quot;Virtual Function Table&quot; (&lt;strong&gt;vtable&lt;/strong&gt;).&lt;/li&gt;
&lt;li&gt;Member &lt;strong&gt;Pointer&lt;/strong&gt;: The address of the Then stage: &lt;code&gt;Then* then_&lt;/code&gt; must be stored.&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;At &lt;strong&gt;48 bytes&lt;/strong&gt; for a simple &lt;strong&gt;3-stage&lt;/strong&gt; pipeline, we are already close to
exhausting a single &lt;strong&gt;64-byte Cache Line&lt;/strong&gt;.&lt;/p&gt;
&lt;p&gt;When you access a memory address, the hardware proactively grabs the
surrounding 64 bytes and loads them into the L1 cache.&lt;/p&gt;
&lt;p&gt;This is &lt;strong&gt;Spatial Locality&lt;/strong&gt;!
if your data is packed tightly, the CPU gets everything it needs in one trip!&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-bash&quot;&gt;❯ cat /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
─────┬────────────────────────────────────────────────────────────────────
     │ File: /sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size
─────┼────────────────────────────────────────────────────────────────────
   1 │ 64
─────┴────────────────────────────────────────────────────────────────────
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Adding a few more stages or some internal state (like a counter), will cause our
pipeline to span &lt;strong&gt;multiple&lt;/strong&gt; cache lines, which is a &lt;strong&gt;biiiig&lt;/strong&gt; problem!&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;The CPU can no longer retrieve the entire Pipeline in a single fetch.
It is now required to issue &lt;strong&gt;multiple&lt;/strong&gt; fetches from &lt;strong&gt;memory&lt;/strong&gt; to pull the
Pipeline into its L1 Cache.&lt;/li&gt;
&lt;li&gt;Because the stages are linked via pointers, the CPU cannot effectively &lt;strong&gt;prefetch&lt;/strong&gt;!
It must wait for the &lt;code&gt;then_&lt;/code&gt; pointer to be resolved, before it even knows which memory
address to request next.&lt;/li&gt;
&lt;/ul&gt;
&lt;h5&gt;CPU Tax&lt;/h5&gt;
&lt;p&gt;When &lt;code&gt;pipeline-&gt;process(x)&lt;/code&gt; is invoked, the CPU cannot simply jump to the next
instruction (ideally, the equivalent of &lt;code&gt;addi&lt;/code&gt;, &lt;code&gt;multi&lt;/code&gt; instructions in MIPS).&lt;/p&gt;
&lt;p&gt;The CPU must:&lt;/p&gt;
&lt;ol&gt;
&lt;li&gt;Dereference the &lt;strong&gt;vptr&lt;/strong&gt; to find the &lt;strong&gt;vtable&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Index into the table to find the address of the &lt;code&gt;process&lt;/code&gt; function.&lt;/li&gt;
&lt;li&gt;Call through the address (&lt;code&gt;call&lt;/code&gt; instruction)&lt;/li&gt;
&lt;/ol&gt;
&lt;p&gt;This is a form of &lt;strong&gt;Pointer-Chasing&lt;/strong&gt;, which is difficult for the CPU&apos;s branch
predictor to optimise, and prevents the Compiler from inlining the code.
Each vtable access is also a potential cache miss.&lt;/p&gt;
&lt;h3&gt;Compile-Time&lt;/h3&gt;
&lt;p&gt;In this version, we use C++ templates to achieve (&lt;strong&gt;more!!&lt;/strong&gt;) flexibility without runtime overhead.&lt;/p&gt;
&lt;p&gt;The key difference: instead of storing a pointer to the &lt;code&gt;Then&lt;/code&gt; stage, the &lt;code&gt;Then&lt;/code&gt; stage is encoded directly in the template parameter.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;struct Store {
  std::int64_t result_;
  void process(std::int64_t x) { result_ = x; }
  std::int64_t result() const { return result_; }
};

template &amp;#x3C;typename Then&gt;
struct Doubler : Then {
  void process(std::int64_t x) { Then::process(x*2); }
};

template &amp;#x3C;typename Then&gt;
struct Adder : Then {
  void process(std::int64_t x) { Then::process(x+1); }
};

using Pipeline = Adder&amp;#x3C;Doubler&amp;#x3C;Store&gt;&gt;;
&lt;/code&gt;&lt;/pre&gt;
&lt;pre&gt;&lt;code class=&quot;language-c++&quot;&gt;Pipeline pipeline;
pipeline.process(x);
return pipeline.result();
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;Given that &lt;code&gt;Pipeline&lt;/code&gt; is now a type, not an Object with pointers, the compiler
knows the exact type at compile-time.&lt;/p&gt;
&lt;h4&gt;Benefits (over Run-Time)&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Zero&lt;/strong&gt; Indirection: No vtable lookups. When &lt;code&gt;pipeline.process(x)&lt;/code&gt; is called,
the compiler knows exactly which function to call, and can inline the entire chain.
The entire pipeline collapse into a few instructions.
Refer to &lt;a href=&quot;#compiler-explorer-comparison&quot;&gt;Compile-time vs Run-time instructions generated&lt;/a&gt;&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Reduced Memory Overhead&lt;/strong&gt;: Since &lt;code&gt;Doubler&lt;/code&gt; and &lt;code&gt;Adder&lt;/code&gt; have no member variables
and hold no state, &lt;code&gt;sizeof(Pipeline)&lt;/code&gt; is just &lt;strong&gt;8 bytes&lt;/strong&gt; for the &lt;code&gt;Store::result_&lt;/code&gt; field,
and can even be optimised further to &lt;strong&gt;1 byte&lt;/strong&gt; (return the result instead of storing),
a vast difference compared to the run-time version of &lt;strong&gt;48 bytes&lt;/strong&gt;, sitting comfortably
within a single cache line.
NOTE: It is inaccurate to call this optimisation
EBCO (&lt;strong&gt;E&lt;/strong&gt;mpty &lt;strong&gt;B&lt;/strong&gt;ase &lt;strong&gt;C&lt;/strong&gt;lass &lt;strong&gt;O&lt;/strong&gt;ptimisation)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;p&gt;This optimisation is how EBCO works anyways!&lt;/p&gt;
&lt;h4&gt;Trade-Offs&lt;/h4&gt;
&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;Static Topology: The pipeline structure is fixed at compile-time. &lt;code&gt;Adder&lt;/code&gt; cannot
be swapped for &lt;code&gt;Multiplier&lt;/code&gt; based on a configuration file or runtime condition.
Thankfully, we can fix this (later)!&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Code-Bloat: For each Pipeline configuration, the compiler generates code for each
instantiation.
This increases binary size and can lead to instruction cache misses.
However, we can minimize &lt;strong&gt;i-cache&lt;/strong&gt; misses using techniques like cache warming and
optimising for the Hot-Path where performance is critical.
Most code paths (including the hot-path) will still see an improvement due to
reduced &lt;strong&gt;data-cache&lt;/strong&gt; misses.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Longer Compile Times: I guess, when run-time performance is vital, this is an
acceptable trade-off.&lt;/p&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;p&gt;Template Bloat: still learning. If it&apos;s interesting enough, maybe I will share it here :)&lt;/p&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;h3&gt;Compiler Explorer Comparison&lt;/h3&gt;
&lt;p&gt;Compiled using &lt;code&gt;CXX_COMPILER=x86-64 gcc 15.2&lt;/code&gt;, &lt;code&gt;CXX_FLAGS=-g -std=c++23 -O3&lt;/code&gt;.
Irrelevant instructions (like &lt;code&gt;main&lt;/code&gt;) are removed for brevity.&lt;/p&gt;
&lt;p&gt;See source code on &lt;a href=&quot;https://godbolt.org/z/3GfMGKWP7&quot;&gt;Compiler Explorer&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;Simplified Abstract Pipeline&lt;/h3&gt;
&lt;pre&gt;&lt;code&gt;x -&gt; x+1 -&gt; (x+1) * 2
&lt;/code&gt;&lt;/pre&gt;
&lt;h4&gt;Compile-Time (Templates)&lt;/h4&gt;
&lt;p&gt;The entire pipeline is inlined, and optimised to a single &lt;code&gt;lea&lt;/code&gt; instruction&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-asm&quot;&gt;compiletime::process(long):
        lea     rax, [rdi+2+rdi]
        ret
&lt;/code&gt;&lt;/pre&gt;
&lt;p&gt;The &lt;code&gt;lea&lt;/code&gt; (Load Effective Address) instruction computes &lt;code&gt;rdi + rdi + 2&lt;/code&gt;,
which is equivalent to &lt;code&gt;2*x + 2&lt;/code&gt; (equivalent to &lt;code&gt;(x+1)*2&lt;/code&gt;).&lt;/p&gt;
&lt;p&gt;The compiler recognised the entire pipeline&apos;s computation and collapsed it
into pure arithmetic - a single instruction!&lt;/p&gt;
&lt;h4&gt;Run-Time (&lt;code&gt;virtual&lt;/code&gt;)&lt;/h4&gt;
&lt;p&gt;Even with &lt;code&gt;-O3&lt;/code&gt; optimisations, virtual function calls cannot be inlined because
the compiler does not know which concrete functions will be called at runtime.&lt;/p&gt;
&lt;pre&gt;&lt;code class=&quot;language-asm&quot;&gt;# Entry Point: Demonstrates vtable lookup overhead
runtime::process(long):
        # pipeline-&gt;process()
        sub     rsp, 8                                 # Allocate stack space
        mov     rsi, rdi                               # rsi = x (input parameter)
        mov     rdi, QWORD PTR runtime::pipeline[rip]  # load pipeline pointer
        mov     rax, QWORD PTR [rdi]                   # load vptr from object
        call    [QWORD PTR [rax]]                      # call through vtable (process function)

        # pipeline-&gt;result()
        mov     rdi, QWORD PTR runtime::pipeline[rip]  # reload pipeline pointer
        mov     rax, QWORD PTR [rdi]                   # reload vptr
        mov     rax, QWORD PTR [rax+8]                 # load vtable[8] (result function)
        add     rsp, 8                                 # clean up stack
        jmp     rax                                    # tail call to result()

# Each Stage requires its own function, with virtual function call overhead: pointer chasing
runtime::Store::process(long):
        mov     QWORD PTR [rdi+8], rsi                 # result_ = x (store)
        ret

runtime::Store::result() const:
        mov     rax, QWORD PTR [rdi+8]                 # Load result_ from memory
        ret

runtime::Doubler::process(long):
        mov     rdi, QWORD PTR [rdi+8]                 # Load then_ pointer
        add     rsi, rsi                               # x = x * 2 (actual work!)
        mov     rax, QWORD PTR [rdi]                   # load vptr from then_ object
        jmp     [QWORD PTR [rax]]                      # jump through vtable to then_-&gt;process()

runtime::Doubler::result() const:
        mov     rdi, QWORD PTR [rdi+8]                 # Load then_ pointer
        mov     rax, QWORD PTR [rdi]                   # Load vptr
        jmp     [QWORD PTR [rax+8]]                    # Jump through vtable to then_-&gt;result()

runtime::Adder::process(long):
        mov     rdi, QWORD PTR [rdi+8]                 # Load then_ pointer
        add     rsi, 1                                 # x = x + 1 (actual work done!)
        mov     rax, QWORD PTR [rdi]                   # load vptr from then_ object
        jmp     [QWORD PTR [rax]]                      # jump through vtable to then_-&gt;process

runtime::Adder::result() const:
        mov     rdi, QWORD PTR [rdi+8]                 # Load then_ pointer
        mov     rax, QWORD PTR [rdi]                   # Load vptr
        jmp     [QWORD PTR [rax+8]]                    # Jump through vtable to then_-&gt;result()

# main omitted

# Global Constructor - runs before main() to initialise vtables and objects.
_GLOBAL__sub_I_runtime::storage:
        movq    xmm0, QWORD PTR .LC0[rip]
        mov     QWORD PTR runtime::storage[rip], OFFSET FLAT:vtable for runtime::Store+16
        movhps  xmm0, QWORD PTR .LC1[rip]
        movaps  XMMWORD PTR runtime::storage[rip+16], xmm0
        movq    xmm0, QWORD PTR .LC2[rip]
        movhps  xmm0, QWORD PTR .LC3[rip]
        movaps  XMMWORD PTR runtime::storage[rip+32], xmm0
        ret

# Runtime Type Information (RTTI) - enables dynamic_cast and typeid
typeinfo name for runtime::Then:
        .string &quot;N7runtime4ThenE&quot;
typeinfo for runtime::Then:
        .quad   vtable for __cxxabiv1::__class_type_info+16
        .quad   typeinfo name for runtime::Then
typeinfo name for runtime::Store:
        .string &quot;N7runtime5StoreE&quot;
typeinfo for runtime::Store:
        .quad   vtable for __cxxabiv1::__si_class_type_info+16
        .quad   typeinfo name for runtime::Store
        .quad   typeinfo for runtime::Then
typeinfo name for runtime::Doubler:
        .string &quot;N7runtime7DoublerE&quot;
typeinfo for runtime::Doubler:
        .quad   vtable for __cxxabiv1::__si_class_type_info+16
        .quad   typeinfo name for runtime::Doubler
        .quad   typeinfo for runtime::Then
typeinfo name for runtime::Adder:
        .string &quot;N7runtime5AdderE&quot;
typeinfo for runtime::Adder:
        .quad   vtable for __cxxabiv1::__si_class_type_info+16
        .quad   typeinfo name for runtime::Adder
        .quad   typeinfo for runtime::Then

# Virtual Function Tables - Array of Function Pointers for each class
vtable for runtime::Store:
        .quad   0                                      # Offset to top
        .quad   typeinfo for runtime::Store            # RTTI pointer
        .quad   runtime::Store::process(long)          # vtable[0]: process()
        .quad   runtime::Store::result() const         # vtable[8]: result()
vtable for runtime::Doubler:
        .quad   0                                      # Offset to top
        .quad   typeinfo for runtime::Doubler          # RTTI pointer
        .quad   runtime::Doubler::process(long)        # vtable[0]: process()
        .quad   runtime::Doubler::result() const       # vtable[8]: result()
vtable for runtime::Adder:
        .quad   0                                      # Offset to top
        .quad   typeinfo for runtime::Adder            # RTTI pointer
        .quad   runtime::Adder::process(long)          # vtable[0]: process()
        .quad   runtime::Adder::result() const         # vtable[8]: result()

# Global Data
runtime::pipeline:
        .quad   runtime::storage+32                 # Points to Adder (entry point)

runtime::storage:
        .zero   48                                  # 48 bytes: Store(16)+Doubler(16)+Adder(16)

# Constants for Initialisation
.LC0:
        .quad   vtable for runtime::Doubler+16
.LC1:
        .quad   runtime::storage
.LC2:
        .quad   vtable for runtime::Adder+16
.LC3:
        .quad   runtime::storage+16
&lt;/code&gt;&lt;/pre&gt;
&lt;h2&gt;Conclusion&lt;/h2&gt;
&lt;p&gt;~~Should be self-explanatory. &lt;code&gt;virtual&lt;/code&gt; generates so much more instructions. So much overhead.~~
&lt;strong&gt;IT DEPENDS.&lt;/strong&gt;&lt;/p&gt;
&lt;h2&gt;Credits&lt;/h2&gt;
&lt;p&gt;Inspired by my internship at Squarepoint on the Trading Controls (Core Trading Services) team.&lt;/p&gt;
&lt;p&gt;Originally called Pipeline, I decided naming it a Directed Acyclic Graph was more appealing to me, considering the ability to FanOut (and FanIn) (&lt;strong&gt;WIP&lt;/strong&gt;).&lt;/p&gt;</content:encoded><h:img src="undefined"/><enclosure url="undefined"/></item></channel></rss>