hybrid-perps-spec

Launch & Rollback Manual (MVP)

This document standardizes the canary strategy, emergency response, control matrix, and rollback procedures for the perpetual futures platform (Approach 2). Ensures smooth Phase 2 and Phase 3 launches with fault recovery.


Canary Strategy

Phase 2 Canary (Routing + Hedging)

Objective: 7 days full HL routing with zero failures, risk calc precision verified

Cycle: 7 days (4 stages)

Stage 1: Full HL Routing (Days 1-3)

Stage 2: Risk Monitoring (Days 4-5)

Stage 3: Simulated Hedging (Days 6-7)

Pass Criteria

Failure: >1% error rate → rollback to Phase 1, fix, recanary


Phase 3 Canary (Internalization Launch)

Objective: 1-week internal test, BTC/ETH internalization open, gradual ramp to production

Cycle: 10 days (4 stages)

Stage 1: Ultra-Conservative (Days 1-3)

Stage 2: Low Risk (Days 4-5)

Stage 3: Target Config (Days 6-7)

Stage 4: Full Open (Days 8-10)

Pass Criteria

Failure:


Control Matrix (Risk Manager Controlled)

All switches in config center (Consul/etcd), effect <10s globally (no restart).

Switch Type Default Scope Permission Rollback Latency
global_betting_enabled bool false Global internalization toggle Super Admin Switch → HL_MODE <1s
routing_mode enum HL_MODE Routing mode (HL_MODE/NORMAL/BETTING) Risk Manager Back to HL_MODE <1s
normal_threshold money $10,000 NORMAL_MODE size limit Risk Manager Increase/set to $0 <5s
betting_threshold money $50,000 BETTING_MODE size limit Risk Manager Lower/disable <5s
auto_mode_switch bool false Auto mode switching Risk Manager Disable <5s
hedge_enabled bool true Hedge engine toggle Risk Manager Disable → manual <1s
hedge_max_leverage float 3.0x Max hedge leverage Risk Manager Lower to 1.5x <5s
per_symbol_betting list BTC,ETH Symbols allowed for internalization Risk Manager Clear (ban all) <5s
liquidation_enabled bool true Liquidation engine Super Admin Disable (manual) <1s
liquidation_check_interval_ms int 500 Liquidation detection frequency Risk Manager Increase to 2000 <5s
withdrawal_enabled bool true Withdrawal toggle (run-on defense) Super Admin Disable <1s
max_internal_exposure_usd money $500,000 Max internal net exposure/symbol Risk Manager Lower <5s
hedge_threshold_tier1 money $100,000 Tier 1 hedge trigger (50%) Risk Manager Raise to $500K <5s
hedge_threshold_tier2 money $500,000 Tier 2 hedge trigger (80%) Risk Manager Raise to $1M <5s
risk_reserve_min_usd money $200,000 Reserve minimum (pause if below) Risk Manager Raise to $300K <5s
daily_net_loss_limit money $500,000 Daily net loss circuit breaker Risk Manager Raise/disable <5s
max_order_size_internal money $10,000 Max single internal order Risk Manager Lower to $1K <5s
force_hl_volatility_pct float 5.0% Force HL if volatility >X Risk Manager Raise to 10% <5s
force_hl_latency_ms int 500 Force HL if HL latency >X Risk Manager Raise to 1000 <5s

Permission Levels:

Change Audit:


Alert System & Emergency Response

P0 Alerts (5-min response, on-call must respond)

Alert Trigger Immediate Action Further
HL disconnection WS/REST down >1min Slack/PagerDuty, pause new orders Reconnect; if >5min → manual
HL account margin <150% margin_ratio < 150% Pause all HL new opens Top up to >200%, reduce leverage
Hedge account margin <200% margin_ratio < 200% Same, reduce to 2x leverage Emergency top-up/closeout
Liquidation engine failure Detection fails 5+ times or >10% execution failure liquidation_enabled=false, manual liquidate Debug, fix, restart
Hot wallet <$100K Available <$100K Pause large withdrawals (>$10K) Transfer from cold wallet
Single-order deviation >5% Price discrepancy >5% Pause symbol HL routing → internal Debug cause
User asset mismatch >0.1% Discrepancy >0.1% Pause trading, start manual reconciliation Manual adjustment

SLA: 5-min confirm, 10-min action. No response after 5min → escalate


Rollback Decision Tree

Incident
    ↓
[Severity Assessment]
    ├─ Critical Path Failure → Immediate Rollback
    ├─ Data Inconsistency → 5-min Rollback
    └─ Non-Critical → Monitor/Fix
    ↓
[Rollback Level]
    ├─ Single Module → Disable that switch + Fix
    ├─ Cross-Module → HL_MODE + disable internalization
    └─ Full System → Degradation Level 3
    ↓
[Execute Rollback]
    └─ Modify config + observe 5min + confirm
    ↓
[Fix & Restart]
    └─ Fix + unit tests + 5% canary → full

Degradation Levels

Level 1: Partial Function (Controllable Risk)

Trigger: Internalization/Liquidation failure (HL OK)

Available:

Switch: global_betting_enabled=false, hedge_enabled=false, routing_mode=NORMAL_MODE


Level 2: Read-Only + Limited (Asset Protection)

Trigger: Liquidation completely broken / Multi-module failures

Available:

Switch: global_betting_enabled=false, routing_mode=HL_MODE, liquidation_enabled=false


Level 3: Maintenance Mode (Emergency Stop)

Trigger: Full outage / Database unavailable / Asset at risk

Available: None (static maintenance page)

Switch: global_api_enabled=false → 503 Service Unavailable

Permission: Super Admin only


Rollback Checklist

Before any rollback:


Implementation Highlights

1. Centralized Switch Management

2. Canary Infrastructure

3. Failure Detection & Auto-Recovery

4. Comprehensive Testing

5. Monitoring & Notifications

6. Documentation & Training