DAG#
Waiting for api.github.com...
Problem Statement#
When processing data, we very often want to (or need-to) chain a series of discrete operations: Transform, Filter, Collect, Fan-Out, etc.
In the simplest case, these are just sequential function calls:
int increment(int x) {
return x + 1;
}
int shiftLeft(int x) {
return x << 1;
}
int process(int x) {
x = increment(x);
x = shiftLeft(x);
return x;
}cScalability “Wall”#
As requirements grow, simple function chaining becomes difficult for a whole bunch of reasons:
-
State Management: What if an operation requires maintaining an internal counter, a buffer (e.g. collecting metrics)? Passing state through function parameters become unwieldy quickly (Parameter Drilling). Global State introduces thread-safety risk, makes the code difficult to reason about, and test.
-
Re-usability: If we want to use the
shiftLeftlogic in differentprocessfunctions, we often run into conflicting requirements. Some use-cases require logging, others need access to a specific context. We inevitably end up with multiple versions ofshiftLeft(e.g.,shiftLeftWithLog,shiftLeftWithContext) with slightly different signatures to accommodate every edge cases. -
Dynamic Topology: In more complex systems, we may want to “Fan-Out” data from a Stage to a “Pass-Through” stage based on configuration without rewriting the core logic. For example, in Staging, we may want to pass-through data because we only have one testing destination server. In Production, we may want to Fan-Out this data to multiple destinations.
-
Testing: This comes without saying. Free-functions cannot be unit-tested easily.
Version 0: Run-Time (Virtual) -> Compile-Time (Templates)#
In Version 0, we solve the problems by using Run-time Polymorphism, which most of us
would be familiar with (virtual in C++, interface in Go, Inheritance in Java).
While this makes the code much more flexible and maintainable than free functions, it introduces a runtime “tax” that we want to avoid in high-performance systems.
Hence, we evolve this to Compile-time polymorphism later. Just because we can in C++.
Refer to Version0.cpp on my Github ↗ for the implementation details.
Key Concepts#
Version 0 treats the pipeline as a bunch of Stages.
Because every stage derives (or inherits) from a common interface Then, the
stages do not need to know the specific type of the “Then” stage, only that it exists.
-
Decoupled Topology: Stages can be swapped out.
- In the run-time version, the
Addercan be swapped for aMultiplierby changing the Constructor arguments - In the compile-time version, change the Template arguments.
- In the run-time version, the
-
Encapsulated State: If
Adderneeds to count how many messages it has processed, it can store astd::uint64_t count_member internally, and is invisible to the rest of the pipeline. Note that this does not yet address the passing of State (Parameter Drilling). -
Testability: We can mock Stages, allowing us to unit-test them in isolation.
Run-Time#
In the Run-Time version, we define an “interface” that allows us to define a
pipeline that processes std::int64_ts:
struct Then {
virtual void process(std::int64_t) = 0;
virtual std::int64_t result() const = 0;
};cstruct Pipeline {
Store store;
Doubler doubler{&store}; // Doubler wraps Store
Adder adder{&doubler}; // Adder wraps Doubler
} storage;cThen* pipeline = &storage.adder; // gets the entry-point into the Pipeline.
pipeline->process(x); // adder.process -> doubler.process -> store.process
return pipeline->result(); // returns final value of x.cThis is simple-enough. But we are using C++.
Trade-Offs#
The flexibility afforded by run-time polymorphism (virtual) comes at the cost
of indirection.
As the name run-time implies, the CPU has to do extra work at run-time to support this “plug-and-play” feature.
Memory Tax#
In the code, sizeof(Then) is 8 bytes. sizeof(Doubler) is 16 bytes
- Virtual Function Table Pointer (vptr): Since
Doublerhas virtual functions, the compiler adds a hidden pointer that points to its class’s “Virtual Function Table” (vtable). - Member Pointer: The address of the Then stage:
Then* then_must be stored.
At 48 bytes for a simple 3-stage pipeline, we are already close to exhausting a single 64-byte Cache Line.
Adding a few more stages or some internal state (like a counter), will cause our pipeline to span multiple cache lines, which is a biiiig problem!
- The CPU can no longer retrieve the entire Pipeline in a single fetch. It is now required to issue multiple fetches from memory to pull the Pipeline into its L1 Cache.
- Because the stages are linked via pointers, the CPU cannot effectively prefetch!
It must wait for the
then_pointer to be resolved, before it even knows which memory address to request next.
CPU Tax#
When pipeline->process(x) is invoked, the CPU cannot simply jump to the next
instruction (ideally, the equivalent of addi, multi instructions in MIPS).
The CPU must:
- Dereference the vptr to find the vtable.
- Index into the table to find the address of the
processfunction. - Call through the address (
callinstruction)
This is a form of Pointer-Chasing, which is difficult for the CPU’s branch predictor to optimise, and prevents the Compiler from inlining the code. Each vtable access is also a potential cache miss.
Compile-Time#
In this version, we use C++ templates to achieve (more!!) flexibility without runtime overhead.
The key difference: instead of storing a pointer to the Then stage, the Then stage is encoded directly in the template parameter.
struct Store {
std::int64_t result_;
void process(std::int64_t x) { result_ = x; }
std::int64_t result() const { return result_; }
};
template <typename Then>
struct Doubler : Then {
void process(std::int64_t x) { Then::process(x*2); }
};
template <typename Then>
struct Adder : Then {
void process(std::int64_t x) { Then::process(x+1); }
};
using Pipeline = Adder<Doubler<Store>>;cPipeline pipeline;
pipeline.process(x);
return pipeline.result();cGiven that Pipeline is now a type, not an Object with pointers, the compiler
knows the exact type at compile-time.
Benefits (over Run-Time)#
-
Zero Indirection: No vtable lookups. When
pipeline.process(x)is called, the compiler knows exactly which function to call, and can inline the entire chain. The entire pipeline collapse into a few instructions. Refer to Compile-time vs Run-time instructions generated -
Reduced Memory Overhead: Since
DoublerandAdderhave no member variables and hold no state,sizeof(Pipeline)is just 8 bytes for theStore::result_field, and can even be optimised further to 1 byte (return the result instead of storing), a vast difference compared to the run-time version of 48 bytes, sitting comfortably within a single cache line. NOTE: It is inaccurate to call this optimisation EBCO (Empty Base Class Optimisation)
Trade-Offs#
-
Static Topology: The pipeline structure is fixed at compile-time.
Addercannot be swapped forMultiplierbased on a configuration file or runtime condition. Thankfully, we can fix this (later)! -
Code-Bloat: For each Pipeline configuration, the compiler generates code for each instantiation. This increases binary size and can lead to instruction cache misses. However, we can minimize i-cache misses using techniques like cache warming and optimising for the Hot-Path where performance is critical. Most code paths (including the hot-path) will still see an improvement due to reduced data-cache misses.
-
Longer Compile Times: I guess, when run-time performance is vital, this is an acceptable trade-off.
-
Template Bloat: still learning. If it’s interesting enough, maybe I will share it here :)
Compiler Explorer Comparison#
Compiled using CXX_COMPILER=x86-64 gcc 15.2, CXX_FLAGS=-g -std=c++23 -O3.
Irrelevant instructions (like main) are removed for brevity.
See source code on Compiler Explorer ↗.
Simplified Abstract Pipeline#
x -> x+1 -> (x+1) * 2plaintextCompile-Time (Templates)#
The entire pipeline is inlined, and optimised to a single lea instruction
compiletime::process(long):
lea rax, [rdi+2+rdi]
retasmThe lea (Load Effective Address) instruction computes rdi + rdi + 2,
which is equivalent to 2*x + 2 (equivalent to (x+1)*2).
The compiler recognised the entire pipeline’s computation and collapsed it into pure arithmetic - a single instruction!
Run-Time (virtual)#
Even with -O3 optimisations, virtual function calls cannot be inlined because
the compiler does not know which concrete functions will be called at runtime.
# Entry Point: Demonstrates vtable lookup overhead
runtime::process(long):
# pipeline->process()
sub rsp, 8 # Allocate stack space
mov rsi, rdi # rsi = x (input parameter)
mov rdi, QWORD PTR runtime::pipeline[rip] # load pipeline pointer
mov rax, QWORD PTR [rdi] # load vptr from object
call [QWORD PTR [rax]] # call through vtable (process function)
# pipeline->result()
mov rdi, QWORD PTR runtime::pipeline[rip] # reload pipeline pointer
mov rax, QWORD PTR [rdi] # reload vptr
mov rax, QWORD PTR [rax+8] # load vtable[8] (result function)
add rsp, 8 # clean up stack
jmp rax # tail call to result()
# Each Stage requires its own function, with virtual function call overhead: pointer chasing
runtime::Store::process(long):
mov QWORD PTR [rdi+8], rsi # result_ = x (store)
ret
runtime::Store::result() const:
mov rax, QWORD PTR [rdi+8] # Load result_ from memory
ret
runtime::Doubler::process(long):
mov rdi, QWORD PTR [rdi+8] # Load then_ pointer
add rsi, rsi # x = x * 2 (actual work!)
mov rax, QWORD PTR [rdi] # load vptr from then_ object
jmp [QWORD PTR [rax]] # jump through vtable to then_->process()
runtime::Doubler::result() const:
mov rdi, QWORD PTR [rdi+8] # Load then_ pointer
mov rax, QWORD PTR [rdi] # Load vptr
jmp [QWORD PTR [rax+8]] # Jump through vtable to then_->result()
runtime::Adder::process(long):
mov rdi, QWORD PTR [rdi+8] # Load then_ pointer
add rsi, 1 # x = x + 1 (actual work done!)
mov rax, QWORD PTR [rdi] # load vptr from then_ object
jmp [QWORD PTR [rax]] # jump through vtable to then_->process
runtime::Adder::result() const:
mov rdi, QWORD PTR [rdi+8] # Load then_ pointer
mov rax, QWORD PTR [rdi] # Load vptr
jmp [QWORD PTR [rax+8]] # Jump through vtable to then_->result()
# main omitted
# Global Constructor - runs before main() to initialise vtables and objects.
_GLOBAL__sub_I_runtime::storage:
movq xmm0, QWORD PTR .LC0[rip]
mov QWORD PTR runtime::storage[rip], OFFSET FLAT:vtable for runtime::Store+16
movhps xmm0, QWORD PTR .LC1[rip]
movaps XMMWORD PTR runtime::storage[rip+16], xmm0
movq xmm0, QWORD PTR .LC2[rip]
movhps xmm0, QWORD PTR .LC3[rip]
movaps XMMWORD PTR runtime::storage[rip+32], xmm0
ret
# Runtime Type Information (RTTI) - enables dynamic_cast and typeid
typeinfo name for runtime::Then:
.string "N7runtime4ThenE"
typeinfo for runtime::Then:
.quad vtable for __cxxabiv1::__class_type_info+16
.quad typeinfo name for runtime::Then
typeinfo name for runtime::Store:
.string "N7runtime5StoreE"
typeinfo for runtime::Store:
.quad vtable for __cxxabiv1::__si_class_type_info+16
.quad typeinfo name for runtime::Store
.quad typeinfo for runtime::Then
typeinfo name for runtime::Doubler:
.string "N7runtime7DoublerE"
typeinfo for runtime::Doubler:
.quad vtable for __cxxabiv1::__si_class_type_info+16
.quad typeinfo name for runtime::Doubler
.quad typeinfo for runtime::Then
typeinfo name for runtime::Adder:
.string "N7runtime5AdderE"
typeinfo for runtime::Adder:
.quad vtable for __cxxabiv1::__si_class_type_info+16
.quad typeinfo name for runtime::Adder
.quad typeinfo for runtime::Then
# Virtual Function Tables - Array of Function Pointers for each class
vtable for runtime::Store:
.quad 0 # Offset to top
.quad typeinfo for runtime::Store # RTTI pointer
.quad runtime::Store::process(long) # vtable[0]: process()
.quad runtime::Store::result() const # vtable[8]: result()
vtable for runtime::Doubler:
.quad 0 # Offset to top
.quad typeinfo for runtime::Doubler # RTTI pointer
.quad runtime::Doubler::process(long) # vtable[0]: process()
.quad runtime::Doubler::result() const # vtable[8]: result()
vtable for runtime::Adder:
.quad 0 # Offset to top
.quad typeinfo for runtime::Adder # RTTI pointer
.quad runtime::Adder::process(long) # vtable[0]: process()
.quad runtime::Adder::result() const # vtable[8]: result()
# Global Data
runtime::pipeline:
.quad runtime::storage+32 # Points to Adder (entry point)
runtime::storage:
.zero 48 # 48 bytes: Store(16)+Doubler(16)+Adder(16)
# Constants for Initialisation
.LC0:
.quad vtable for runtime::Doubler+16
.LC1:
.quad runtime::storage
.LC2:
.quad vtable for runtime::Adder+16
.LC3:
.quad runtime::storage+16asmConclusion#
Should be self-explanatory.
IT DEPENDS.virtual generates so much more instructions. So much overhead.
Credits#
Inspired by my internship at Squarepoint on the Trading Controls (Core Trading Services) team.
Originally called Pipeline, I decided naming it a Directed Acyclic Graph was more appealing to me, considering the ability to FanOut (and FanIn) (WIP).