Project

Reinforcement Learning Market-Making Agent

Live App: rlmarketmaker-7yho3khdjxpcsppuwappuas.streamlit.app · GitHub: github.com/rcodeborg2311/rl-market-maker

Overview

A PPO-trained reinforcement learning agent that learns to quote bid/ask spreads on BTC-USD from live Coinbase Advanced Trade WebSocket L2 order book data, outperforming a TWAP baseline on PnL and inventory risk across 100 out-of-sample evaluation episodes.

Technical Highlights

  • PPO Agent Performance: Engineered a PPO-trained market-making agent in PyTorch achieving 2.4× higher mean episode PnL ($0.031 vs $0.013) and 8% lower inventory exposure than a TWAP baseline across 100 out-of-sample evaluation episodes.

  • Avellaneda-Stoikov Reward Function: Implemented an AS-inspired reward function r = spread_pnl − γq²σ² with scaled inventory penalty and a 20-dimensional microstructure state vector including Kyle’s lambda (price impact), realized volatility, order flow imbalance, spread z-score, and depth slope, utilizing signals modeled after those used by professional market makers at Citadel Securities and Virtu Financial.

  • Custom Gymnasium Environment: Designed with a causal fill simulator (next-snapshot trade matching), GAE advantage estimation (λ=0.95), PPO clipping (ε=0.2), and LayerNorm Actor-Critic networks (256→128) with orthogonal weight initialization and tanh-squashed Gaussian action distribution for bounded continuous spread control.

  • First-Principles Inventory Skewing: Derived dynamic inventory skewing from Kyle (1985) and Avellaneda-Stoikov (2008). The agent learns asymmetric bid/ask offsets conditioned on signed inventory and realized volatility, replicating the closed-form reservation price adjustment r* = m − qγσ²(T−t) without analytical assumptions.

  • Bloomberg-Style Trading Dashboard: Delivered a Plotly Dash dashboard (500ms refresh) with real-time order book depth visualization, agent quote overlays, dual-axis PnL/inventory time series with drawdown fill, spread z-score history, and inventory risk bars. It falls back gracefully to a TWAP baseline when no trained model is present.

Test Coverage

26/26 pytest unit tests with zero failures across 4 test modules covering OBI edge cases, fill simulator partial fills, GAE shape invariants, and end-to-end PPO update correctness. Seeded for full reproducibility (torch.manual_seed(42), np.random.seed(42)).

Key Results

MetricRL AgentTWAP Baseline
Mean episode PnL$0.031$0.013
Inventory exposureBaseline − 8%Baseline
Eval episodes100 out-of-sample100
Test suite26/26 passingN/A

Architecture

ComponentDetail
AlgorithmPPO (clipped, ε=0.2)
NetworksLayerNorm Actor-Critic (256→128)
State space20-dim (OBI, Kyle-λ, vol, spread z-score, depth slope)
AdvantageGAE (λ=0.95)
Action spaceContinuous bid/ask offset (tanh-squashed Gaussian)
Data feedCoinbase Advanced Trade WebSocket L2
DashboardPlotly Dash, 500ms refresh