Eu Chang Xian

Back

DAG#

euchangxian / goose

Waiting for api.github.com...

???
???
???
?????

Problem Statement#

When processing data, we very often want to (or need-to) chain a series of discrete operations: Transform, Filter, Collect, Fan-Out, etc.

In the simplest case, these are just sequential function calls:

int increment(int x) {
  return x + 1;
}

int shiftLeft(int x) {
  return x << 1;
}

int process(int x) {
  x = increment(x);
  x = shiftLeft(x);
  return x;
}
c

Scalability “Wall”#

As requirements grow, simple function chaining becomes difficult for a whole bunch of reasons:

  • State Management: What if an operation requires maintaining an internal counter, a buffer (e.g. collecting metrics)? Passing state through function parameters become unwieldy quickly (Parameter Drilling). Global State introduces thread-safety risk, makes the code difficult to reason about, and test.

  • Re-usability: If we want to use the shiftLeft logic in different process functions, we often run into conflicting requirements. Some use-cases require logging, others need access to a specific context. We inevitably end up with multiple versions of shiftLeft (e.g., shiftLeftWithLog, shiftLeftWithContext) with slightly different signatures to accommodate every edge cases.

  • Dynamic Topology: In more complex systems, we may want to “Fan-Out” data from a Stage to a “Pass-Through” stage based on configuration without rewriting the core logic. For example, in Staging, we may want to pass-through data because we only have one testing destination server. In Production, we may want to Fan-Out this data to multiple destinations.

  • Testing: This comes without saying. Free-functions cannot be unit-tested easily.

Version 0: Run-Time (Virtual) -> Compile-Time (Templates)#

In Version 0, we solve the problems by using Run-time Polymorphism, which most of us would be familiar with (virtual in C++, interface in Go, Inheritance in Java).

While this makes the code much more flexible and maintainable than free functions, it introduces a runtime “tax” that we want to avoid in high-performance systems.

Hence, we evolve this to Compile-time polymorphism later. Just because we can in C++.

Refer to Version0.cpp on my Github for the implementation details.

Key Concepts#

Version 0 treats the pipeline as a bunch of Stages. Because every stage derives (or inherits) from a common interface Then, the stages do not need to know the specific type of the “Then” stage, only that it exists.

  • Decoupled Topology: Stages can be swapped out.

    • In the run-time version, the Adder can be swapped for a Multiplier by changing the Constructor arguments
    • In the compile-time version, change the Template arguments.
  • Encapsulated State: If Adder needs to count how many messages it has processed, it can store a std::uint64_t count_ member internally, and is invisible to the rest of the pipeline. Note that this does not yet address the passing of State (Parameter Drilling).

  • Testability: We can mock Stages, allowing us to unit-test them in isolation.

Run-Time#

In the Run-Time version, we define an “interface” that allows us to define a pipeline that processes std::int64_ts:

struct Then {
  virtual void process(std::int64_t) = 0;
  virtual std::int64_t result() const = 0;
};
c
struct Pipeline {
  Store store;
  Doubler doubler{&store}; // Doubler wraps Store
  Adder adder{&doubler};   // Adder wraps Doubler
} storage;
c
Then* pipeline = &storage.adder; // gets the entry-point into the Pipeline.
pipeline->process(x); // adder.process -> doubler.process -> store.process
return pipeline->result(); // returns final value of x.
c

This is simple-enough. But we are using C++.

Trade-Offs#

The flexibility afforded by run-time polymorphism (virtual) comes at the cost of indirection.

As the name run-time implies, the CPU has to do extra work at run-time to support this “plug-and-play” feature.

Memory Tax#

In the code, sizeof(Then) is 8 bytes. sizeof(Doubler) is 16 bytes

Why 8 bytes?

  1. Virtual Function Table Pointer (vptr): Since Doubler has virtual functions, the compiler adds a hidden pointer that points to its class’s “Virtual Function Table” (vtable).
  2. Member Pointer: The address of the Then stage: Then* then_ must be stored.

At 48 bytes for a simple 3-stage pipeline, we are already close to exhausting a single 64-byte Cache Line.

Why 64-byte Cache Lines?

Adding a few more stages or some internal state (like a counter), will cause our pipeline to span multiple cache lines, which is a biiiig problem!

  • The CPU can no longer retrieve the entire Pipeline in a single fetch. It is now required to issue multiple fetches from memory to pull the Pipeline into its L1 Cache.
  • Because the stages are linked via pointers, the CPU cannot effectively prefetch! It must wait for the then_ pointer to be resolved, before it even knows which memory address to request next.
CPU Tax#

When pipeline->process(x) is invoked, the CPU cannot simply jump to the next instruction (ideally, the equivalent of addi, multi instructions in MIPS).

The CPU must:

  1. Dereference the vptr to find the vtable.
  2. Index into the table to find the address of the process function.
  3. Call through the address (call instruction)

This is a form of Pointer-Chasing, which is difficult for the CPU’s branch predictor to optimise, and prevents the Compiler from inlining the code. Each vtable access is also a potential cache miss.

Compile-Time#

In this version, we use C++ templates to achieve (more!!) flexibility without runtime overhead.

The key difference: instead of storing a pointer to the Then stage, the Then stage is encoded directly in the template parameter.

Pipeline pipeline;
pipeline.process(x);
return pipeline.result();
c

Given that Pipeline is now a type, not an Object with pointers, the compiler knows the exact type at compile-time.

Benefits (over Run-Time)#

  • Zero Indirection: No vtable lookups. When pipeline.process(x) is called, the compiler knows exactly which function to call, and can inline the entire chain. The entire pipeline collapse into a few instructions. Refer to Compile-time vs Run-time instructions generated

  • Reduced Memory Overhead: Since Doubler and Adder have no member variables and hold no state, sizeof(Pipeline) is just 8 bytes for the Store::result_ field, and can even be optimised further to 1 byte (return the result instead of storing), a vast difference compared to the run-time version of 48 bytes, sitting comfortably within a single cache line. NOTE: It is inaccurate to call this optimisation EBCO (Empty Base Class Optimisation)

Why not EBCO?

Trade-Offs#

  • Static Topology: The pipeline structure is fixed at compile-time. Adder cannot be swapped for Multiplier based on a configuration file or runtime condition. Thankfully, we can fix this (later)!

  • Code-Bloat: For each Pipeline configuration, the compiler generates code for each instantiation. This increases binary size and can lead to instruction cache misses. However, we can minimize i-cache misses using techniques like cache warming and optimising for the Hot-Path where performance is critical. Most code paths (including the hot-path) will still see an improvement due to reduced data-cache misses.

  • Longer Compile Times: I guess, when run-time performance is vital, this is an acceptable trade-off.

  • Template Bloat: still learning. If it’s interesting enough, maybe I will share it here :)

Compiler Explorer Comparison#

Compiled using CXX_COMPILER=x86-64 gcc 15.2, CXX_FLAGS=-g -std=c++23 -O3. Irrelevant instructions (like main) are removed for brevity.

See source code on Compiler Explorer.

Simplified Abstract Pipeline#

x -> x+1 -> (x+1) * 2
plaintext

Compile-Time (Templates)#

The entire pipeline is inlined, and optimised to a single lea instruction

compiletime::process(long):
        lea     rax, [rdi+2+rdi]
        ret
asm

The lea (Load Effective Address) instruction computes rdi + rdi + 2, which is equivalent to 2*x + 2 (equivalent to (x+1)*2).

The compiler recognised the entire pipeline’s computation and collapsed it into pure arithmetic - a single instruction!

Run-Time (virtual)#

Even with -O3 optimisations, virtual function calls cannot be inlined because the compiler does not know which concrete functions will be called at runtime.

Conclusion#

Should be self-explanatory. virtual generates so much more instructions. So much overhead. IT DEPENDS.

Credits#

Inspired by my internship at Squarepoint on the Trading Controls (Core Trading Services) team.

Originally called Pipeline, I decided naming it a Directed Acyclic Graph was more appealing to me, considering the ability to FanOut (and FanIn) (WIP).

C++ Directed Acyclic Graphs (DAG) - Version 0
Author Eu Chang Xian
Published at 19 January 2026