Reinforcement Learning Market-Making Agent

Live App: rlmarketmaker-7yho3khdjxpcsppuwappuas.streamlit.app · GitHub: github.com/rcodeborg2311/rl-market-maker

Overview

A PPO-trained reinforcement learning agent that learns to quote bid/ask spreads on BTC-USD from live Coinbase Advanced Trade WebSocket L2 order book data, outperforming a TWAP baseline on PnL and inventory risk across 100 out-of-sample evaluation episodes.

Technical Highlights

PPO Agent Performance: Engineered a PPO-trained market-making agent in PyTorch achieving 2.4× higher mean episode PnL ($0.031 vs $0.013) and 8% lower inventory exposure than a TWAP baseline across 100 out-of-sample evaluation episodes.
Avellaneda-Stoikov Reward Function: Implemented an AS-inspired reward function r = spread_pnl − γq²σ² with scaled inventory penalty and a 20-dimensional microstructure state vector including Kyle’s lambda (price impact), realized volatility, order flow imbalance, spread z-score, and depth slope, utilizing signals modeled after those used by professional market makers at Citadel Securities and Virtu Financial.
Custom Gymnasium Environment: Designed with a causal fill simulator (next-snapshot trade matching), GAE advantage estimation (λ=0.95), PPO clipping (ε=0.2), and LayerNorm Actor-Critic networks (256→128) with orthogonal weight initialization and tanh-squashed Gaussian action distribution for bounded continuous spread control.
First-Principles Inventory Skewing: Derived dynamic inventory skewing from Kyle (1985) and Avellaneda-Stoikov (2008). The agent learns asymmetric bid/ask offsets conditioned on signed inventory and realized volatility, replicating the closed-form reservation price adjustment r* = m − qγσ²(T−t) without analytical assumptions.
Bloomberg-Style Trading Dashboard: Delivered a Plotly Dash dashboard (500ms refresh) with real-time order book depth visualization, agent quote overlays, dual-axis PnL/inventory time series with drawdown fill, spread z-score history, and inventory risk bars. It falls back gracefully to a TWAP baseline when no trained model is present.

Test Coverage

26/26 pytest unit tests with zero failures across 4 test modules covering OBI edge cases, fill simulator partial fills, GAE shape invariants, and end-to-end PPO update correctness. Seeded for full reproducibility (torch.manual_seed(42), np.random.seed(42)).

Key Results

Metric	RL Agent	TWAP Baseline
Mean episode PnL	$0.031	$0.013
Inventory exposure	Baseline − 8%	Baseline
Eval episodes	100 out-of-sample	100
Test suite	26/26 passing	N/A

Architecture

Component	Detail
Algorithm	PPO (clipped, ε=0.2)
Networks	LayerNorm Actor-Critic (256→128)
State space	20-dim (OBI, Kyle-λ, vol, spread z-score, depth slope)
Advantage	GAE (λ=0.95)
Action space	Continuous bid/ask offset (tanh-squashed Gaussian)
Data feed	Coinbase Advanced Trade WebSocket L2
Dashboard	Plotly Dash, 500ms refresh