The Backtest-to-Live Gap Is a Cost Model Problem

Every quant team has had this conversation. The backtest says +2.1% per month, clean Sharpe, controlled drawdown. The strategy goes live. A month in, the live account is returning +1.4% at the same monthly cadence. Someone asks, "Is the alpha decaying?" and the answer is almost always: no.

The culprit isn't the signal. It's the cost model.

This post is about why the backtest-to-live gap is, in practice, a cost accounting problem — and what closing it takes on a hybrid venue footprint like the one Dnalyaw runs on. The specifics are trading-flavored. The underlying lesson — that your replay needs to know about every cost your live execution actually pays — generalizes to any system where simulation is supposed to predict reality.

The Easy Part: Bid-Ask Slippage

Most quant frameworks stop at bid-ask slippage. You cross the spread; the backtest models that by assuming you always fill at the ask (for buys) or the bid (for sells). If you're using an Almgren-Chriss square-root impact model, you add in some fraction of σ × √(Q/V) for self-induced price movement. These are the costs that every textbook covers and every mature backtest engine includes.

These are also the costs that rarely account for the meaningful backtest-to-live gap. The gap isn't in the mid-to-fill spread. It's in the three compounding layers below — each one easy to miss in isolation, devastating in combination.

The Hard Part: Three Compounding Layers

The shape matters more than the specific percentages. The structure of the gap — small-but-modeled slippage, large-and-routinely-miscounted commission timing, large-and-routinely-ignored FX, small-but-real microstructure — is what most teams have in common. Hedging one layer (bid-ask slippage) while ignoring the others produces the classic "backtest says profitable, live says break-even" result.

Each of the three red layers is its own engineering problem. All three compound.

Layer 1: Commission Timing

The brokerage-fee model looks simple on paper: X bps per share, or a fixed fee per trade. The operational reality is not simple.

On Interactive Brokers — the quant-native venue most systematic funds use for US equities — a fill and its commission arrive on different callbacks. The execution report comes first, keyed by execId. The commission report follows later, also keyed by execId. The two must be joined after the fact. If your system reads the fill and immediately closes the trade loop — computing P&L, updating position tracking, feeding into risk — it does so with zero commission information. The commission is booked when it arrives, asynchronously.

That asymmetry is not a "late callback" inconvenience. It is a race condition on accounting state: the fill event triggers the P&L pipeline, the commission event mutates that pipeline's inputs after, and any consumer that reads the intermediate state sees structurally wrong numbers. Risk sees a position whose realized cost is understated. Downstream strategy sees a Sharpe with a systematic positive bias. The backtest, which books synchronously, sees neither of these hazards — so its P&L quietly diverges from the broker statement every day.

A naive implementation attributes commissions to the wrong fill, or misses them entirely when the callback arrives during a reconnect. We wrote about this under the framing that verification layers need explicit state models, not timeouts — commission binding is the same shape of problem: an implicit "it'll just show up" assumption that breaks under load, and a state-model-based fix that makes the late binding explicit and auditable.

Two seconds doesn't sound like much. At high frequency it's billions of fills per year across the industry, and the commission figure matters for every single one. At our frequency, it's the difference between a P&L reconciliation that closes every day and one that drifts for weeks before anyone notices.

Futu's equivalent callback has different timing and a different key structure, but the asymmetry is identical: fills are fast, fees are asynchronous. Any cost model that assumes synchronous fee booking is going to disagree with the broker statement at month-end — and the broker is the one that's right.

Layer 2: FX Normalization

Run one venue in one currency and you can ignore FX. Run two venues in two currencies and you cannot.

Dnalyaw's hybrid footprint has this built in: IB US fills in USD, Futu HK fills in HKD, and — when the instrument is HK-listed but cleared in JPY-mapped accounts — FX needs to cross two exchange rates before settling in the portfolio base currency. A backtest written assuming "everything is in USD" is right by accident when your positions happen to be US-only, and wrong every other time.

The non-obvious part: it's not just P&L that needs the conversion. Every consumer — the risk engine computing exposure limits, the order manager sizing position budgets, the backtest replay calculating realized returns — has to normalize. Getting this right in one place is easy. Getting it right in every place is the work. The Rust risk engine, Go OMS, and Python research pipeline each had their own FX pathway that had to be audited and unified.

The failure mode when you miss one: the risk engine sees a 5% HKD position as if it were 5% of a USD portfolio, lets it through limit checks, then P&L realizes in USD and shows something different. The backtest never saw the violation because its FX assumptions were uniform. The live account discovers it the hard way.

Layer 3: Cash Seeding and Fee Schedules

The quietest layer, the one no post-mortem talks about, is that every broker seeds cash differently.

Paper accounts frequently start with wrong initial balances, non-standard rounding, or missing currency sub-accounts. Live accounts start clean but with broker-specific fee schedules that your backtest has probably never loaded. Commissions on IB US are tiered; on Futu HK they're both per-trade and per-dollar-traded; on both, there are additional regulatory fees (SEC, TAF, FINRA on US; levies on HK) that are small individually and consequential cumulatively.

None of this is exotic. All of it is easy to forget. The failure mode is that for the first few weeks of live trading your P&L drifts from your backtest by a few basis points a day, and each day you reassure yourself that you're "within expected noise," and then at the month-end reconciliation you discover that "within expected noise" was actually a systematic bias you missed.

Fixing this is tedious, not clever: load real fee schedules per venue, normalize cash seeding against known broker conventions, and — critically — feed the broker statement back into a reconciliation report every day, not every month.

The Hybrid Venue Amplifier

All three cost layers get harder on a hybrid venue footprint — and more valuable to get right.

A strategy that runs only on IB has one commission schedule, one FX pair, one set of regulatory fees to load. A strategy that runs on IB and Futu HK has two of each, with callback timings that differ, currency sub-accounts that need reconciliation, and fee schedules that update on independent cadences. The cost model has to be a unified abstraction — one where adding Futu doesn't require rewriting the replay engine, and one where the shadow tracker treats USD and HKD P&L with the same accounting discipline.

The upside: when you get it right, the same system serves both venues with one set of invariants. That's not just a nicety — in the quant business, multi-venue access is strategic. IB is where institutional liquidity lives; Futu HK is where Asia retail-and-prop flow lives. A platform that can run identically across both, with identical cost-model correctness on both, is a platform that can express strategies neither venue alone can support.

Which is also why a calibration pipeline is non-negotiable. Building a unified cost abstraction across two venues without a drift tracker is building the abstraction blind — you have no way to tell whether it's actually correct until a reconciliation fails at month-end. The pipeline is what lets the abstraction earn its keep.

The Fix: Shadow Execution

Closing the gap is a pipeline problem, not a modeling problem. You need two P&L traces running in parallel — one from the live account, one from the backtest replay with the same data the live system saw — and a drift tracker that flags divergence in real time.

Concretely: the modeled P&L uses the same commission figures the live execution paid, the same FX rates live quoted, the same fill quality live observed. When backtest and live agree within tolerance, the strategy earns the right to more capital. When they disagree, the model has a bug that needs to be found before the capital increase.

This is why capital allocation in Dnalyaw is staged: validation → calibration → limited → full. Each stage has explicit quantitative graduation criteria along four dimensions — daily P&L, Sharpe ratio, fill rates, and average slippage. Strategies don't advance because a backtest looked good. They advance because real-time execution matches modeled execution within tolerance for a sustained window.

The shadow replay is the thing that makes this loop possible. Without it, you're comparing a backtest from last quarter against a live statement from this month and hoping the market regime didn't change in between. With it, you're comparing today's live fills against today's shadow on the same tape.

What This Doesn't Solve

Being honest about the remaining gap:

Market impact is still a model. The Almgren-Chriss framework — or whatever impact model you use — approximates reality. Real market impact is path-dependent, depends on hidden order-book state, and is sensitive to who else is trading at the same moment. The replay uses the best model we have; it doesn't know the ground truth because nobody does.
Fee schedules change. IB, Futu, and every venue we onboard update their pricing on their own cadence. The normalization pipeline has to track these updates; miss one and a quiet bias creeps in over weeks.
Tax and settlement accounting is downstream. Broker commissions are part of the cost model; tax treatment of gains and losses is not — that's the portfolio accounting layer. The two have to agree at year-end, but they're solved by different systems.
Market regime change breaks the comparison. Shadow replay assumes the structural cost model holds. A regime where bid-ask spreads widen or commissions rise eats the same proportional bite out of both live and shadow — the P&L difference stays small while both fall. Shadow tracking is a calibration tool, not a regime detector.

Documenting these, rather than hiding them inside "realistic assumptions," is how we avoid the trap of shadow replay quietly becoming another thing that silently papers over the gap.

The Lesson

The research layer is visible; everyone sees the strategy. The execution layer is where correctness gets earned — one cost layer at a time, mostly out of sight.

If your backtest predicts live within 10 basis points per month, your signal is fine and your cost model is solid. If it predicts within 50 basis points, your signal is probably fine and your cost model has work. If it predicts within 200 basis points or worse, don't touch the signal yet — fix the cost model first.

The specific fixes — execId late-binding on IB, FX normalization across Rust/Go/Python consumers, cash seeding normalization, venue-specific fee schedule loading — are Dnalyaw-specific. The generalizable instinct is: if backtest-live drift is your problem, the cost model is where to look first.

Concretely, if you're auditing an existing platform's calibration pipeline: the first question to ask is whether the backtest replay knows the commission arrived 2.3 seconds after the fill, or whether it booked everything synchronously. If it's the latter, you've found a months-long source of silent bias. Everything else is secondary until that's fixed.

This is the last in a four-post series on the engineering discipline behind vertical integration of research and execution — from flatten verifiers to prompt-cache stability to agent checkpoint design to the cost-model pipeline here. Different systems, same pattern: the edge lives in the invariants you enforce, not the models you train.

Flatten, cache, turn, cost. Each of them the place an ordinary PR can silently turn a good system into a broken one. And each of them the place a small amount of discipline compounds into the kind of correctness that shows up on the monthly reconciliation.