Zero-Downtime Hot Reload for Live Trading Systems

Every restart is a gap. In live trading, gaps cost money.

The default deployment model — stop the process, swap the code, restart — is cargo-culted from web services where a 200ms downtime window is invisible. In a trading bot executing against time-boxed prediction markets with 5-minute resolution windows, that restart window is a missed entry, a skipped settlement, or a position left unmanaged during volatility. The bot running my Polymarket strategy has a 91% win rate. Every forced outage is a direct tax on that edge.

PR #124 shipped on March 2nd and eliminated that tax entirely. The bot now reloads strategy logic, configuration, and behavioral parameters at runtime — without stopping execution, without interrupting open positions, without dropping the heartbeat.

Here's why the architecture works, and why most systems don't bother until it's too late.

Stopping the Process Is a Cognitive Default, Not an Engineering Requirement

The restart-to-deploy pattern persists because it's simple to reason about. You get a clean slate. No state bleed. No partially-loaded modules. Engineers reach for it the same way junior tacticians reach for frontal assaults — it's obvious, it's predictable, and it costs more than necessary.

The actual constraint is narrower: you need new code to execute future logic without corrupting current state. That's a scoping problem, not a lifecycle problem. Restarting the entire process to solve a scoping problem is the same as rebooting a server to fix a misconfigured route.

In all fighting, the direct method may be used for joining battle, but indirect methods will be needed in order to secure victory.
— Sun Tzu · The Art of War

The indirect method here: don't touch the process boundary. Operate inside it. The bot maintains a running event loop, active WebSocket connections, open position tracking, and a 30-second heartbeat writing state to bot_state. Interrupting that loop doesn't just pause the bot — it severs context that took minutes to rebuild. Reconnection latency, re-authentication, state reconstruction from disk. In a 5-minute market window, that's the whole window.

in-process reload is the architectural alternative. Reload the module graph inside the running process, swap out strategy callables, preserve everything the process already owns.

The Reload Surface Has to Be Deliberately Bounded

The naive hot reload implementation reloads everything. That's worse than a restart in most cases — you get partial state corruption, stale closures holding references to old class instances, and race conditions between the reload event and active execution paths.

The discipline is in scoping the reload surface aggressively.

The bot's architecture separates concerns across a clean tiered structure. Tier 1 is the execution core: order management, position tracking, the claimer, fee logic. That layer never reloads. It holds live state — open bets, capital allocation, settlement tracking. Touching it mid-run is surgery on a beating heart.

Tier 2 is strategy logic: strategy.py, momentum_strategy.py, ta_engine.py, candle_engine.py, the overreaction detector. These are pure analytical functions. They take market data in and return signals out. No persistent state, no open connections, no side effects that would corrupt if swapped mid-cycle.

◈INSIGHT

The reload boundary isn't defined by what you want to change — it's defined by what carries no mutable runtime state. If a module owns a connection, a lock, or an open position reference, it's outside the reload surface. Everything else is fair game.

The implementation uses Python's importlib.reload() against a controlled module list, triggered by a file watch on a control JSON. The bot's control panel (PR #81) already established the pattern of runtime behavioral control via shared JSON files — control.json for commands, status.json for state reporting. Hot reload slots into that same control plane. The file watcher detects a version bump in control.json, the reload handler fires between execution cycles, and the new strategy callables are live for the next market evaluation pass.

Crucially, the reload fires between cycles — not during one. The execution loop has natural synchronization points: after a position evaluation completes, before the next market scan begins. That's the insertion window. No locks needed. No async coordination overhead. The cycle boundary is the synchronization primitive.

Runtime Control Is an Architectural Commitment, Not a Feature

PR #81 didn't just add a UI — it established a control plane that hot reload depends on. Eight backend endpoints in bot_control.py handle runtime commands: pause, resume, adjust position sizing, toggle strategy modes, inspect current state. The frontend mission control panel surfaces these to me as operator without requiring a terminal session.

This matters architecturally because hot reload is not a deployment mechanism in isolation. It's a capability that only makes sense when the system has a broader runtime control model.

operator-in-the-loop deployment is the pattern: the operator has full visibility into bot state before issuing a reload, can pause execution first if the reload surface extends to borderline modules, and can verify status.json confirms healthy state post-reload. Compare that to a blind restart — you fire the process, watch logs, hope state reconstructs cleanly.

Bot Win Rate (Active)

91%

Polymarket directional calls, live money

The control plane architecture also separates concerns across the team interface. The bot is a principal that executes. Mission Control is the command surface for the operator. Neither needs to be the same process, the same repo, or even the same machine. They communicate through the shared JSON control files — a deliberately simple IPC mechanism that's inspectable, loggable, and trivially debuggable. No message broker. No RPC layer. A file on disk that both sides read and write, with writes atomic at the OS level.

That simplicity is a deliberate design decision. The alternative — a WebSocket command channel, a Redis pub/sub layer, an internal gRPC interface — introduces failure modes that matter in a live trading context. If the control channel goes down, can the bot still execute? With file-based control, yes. The bot reads control.json on each cycle. If the file is stale, it runs with the last known good configuration. Graceful degradation by default.

What This Pattern Generalizes To

The hot reload architecture for the trading bot is a specific instance of a broader principle: systems that operate continuously require a different deployment mental model than systems that tolerate interruption.

Most web services tolerate interruption. Load balancers route around restarting instances. Retry logic handles the gap. The cost of a restart is invisible to users. Engineers optimize for simplicity of deployment over continuity of execution, and that's the right call for that context.

Systems that cannot tolerate interruption — trading bots, real-time data pipelines, monitoring agents, anything with open financial positions or time-sensitive execution windows — need deployment designed around the constraint from the beginning, not retrofitted after the first missed trade.

⚔DOCTRINE

Design for continuity first. A system that cannot restart cleanly under pressure will also not reload cleanly. The hot reload capability forced cleaner module boundaries, clearer state ownership, and a more disciplined separation between execution core and strategy logic. The deployment requirement improved the architecture.

continuity-first design means the question "how do we deploy this without stopping it" is asked before the first line of strategy code, not after the bot goes live. The module boundary discipline required for hot reload is the same discipline that makes the system testable, the same discipline that makes strategy iteration fast, the same discipline that makes the 91% win rate reproducible rather than brittle.

The bot running against live Polymarket markets now gets strategy updates the same way a warship gets new orders — continuously, without returning to port. The mission doesn't pause because the doctrine evolved.

Visual Summary

click to expand

Zero-Downtime Hot Reload for Live Trading Systems

Stopping the Process Is a Cognitive Default, Not an Engineering Requirement

The Reload Surface Has to Be Deliberately Bounded

Runtime Control Is an Architectural Commitment, Not a Feature

What This Pattern Generalizes To

Follow the Signal

Alpha Journal: Engineering a Self-Improving Trading Signal System

Leonardo AI Share Cards: Building Cinematic Social Graphics with Playwright and Base64 Embedding

Foresight v5.0: How I Rebuilt a Prediction Market Bot Around Candle Boundaries