HFT-grade performance engineering - 434M orders/sec with sub-microsecond latency

Performance Overview

LX is engineered for High-Frequency Trading (HFT) performance, achieving metrics that compete with the world's fastest exchanges.

Performance Targets vs Achieved

Metric	Target	Achieved	Notes
Order Latency (GPU)	<1 us	2 ns	500x better than target
Order Latency (CPU)	<1 us	487 ns	Lock-free data structures
Throughput (CPU)	1M/sec	1.01M/sec	Pure Go, no CGO
Throughput (GPU)	100M/sec	434M/sec	MLX on Apple Silicon
Memory per 100K orders	<1 GB	847 MB	Object pooling
P99 Latency	<10 us	2.3 us	Tail latency controlled
GC Pause	<1 ms	0.3 ms	GOGC tuning

Comparison with Major Exchanges

Exchange	Throughput	Latency	Technology
LX (MLX)	5.95M msgs/sec	0.68 μs	MLX/Metal GPU
LX (GPU)	434M orders/sec	2 ns	MLX batch processing
LX (C++)	1.08M msgs/sec	4.87 μs	Pure C++
LX (CPU)	1.01M orders/sec	487 ns	Pure Go
CME Globex	100K/sec	1-5 ms	Custom hardware
Binance	1.4M/sec	5-10 ms	Java/C++
FTX (historical)	1M/sec	500 us	Rust
NYSE Arca	500K/sec	50 us	Custom FPGA

FIX Protocol Performance (December 2024)

Engine	NewOrderSingle	ExecutionReport	MarketDataSnapshot	Avg Latency
Pure Go	163K/sec	124K/sec	332K/sec	33.5 μs
Hybrid Go/C++	167K/sec	378K/sec	616K/sec	17.3 μs
Pure C++	444K/sec	804K/sec	1.08M/sec	8.2 μs
Rust	484K/sec	232K/sec	586K/sec	11.9 μs
TypeScript	45K/sec	21K/sec	38K/sec	159.2 μs
MLX (Apple Silicon)	3.12M/sec	4.27M/sec	5.95M/sec	1.08 μs

*MLX achieves 7-40x throughput improvement over CPU implementations via Metal GPU parallelism

Architecture for Performance

┌─────────────────────────────────────────────────────────────────────────┐
│                     PERFORMANCE-CRITICAL PATH                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   Network Layer (0.5-2 us)                                              │
│   ┌───────────────────────────────────────────────────────────────┐    │
│   │  Kernel Bypass (io_uring) │ TCP_NODELAY │ SO_BUSY_POLL        │    │
│   │  Zero-Copy Receive │ Multicast Groups │ DPDK (optional)       │    │
│   └───────────────────────────────────────────────────────────────┘    │
│                              ↓                                          │
│   Protocol Layer (100-500 ns)                                          │
│   ┌───────────────────────────────────────────────────────────────┐    │
│   │  FlatBuffers (zero-copy) │ Protobuf (pooled) │ Binary codec   │    │
│   │  Pre-allocated buffers │ Arena allocation │ No reflection     │    │
│   └───────────────────────────────────────────────────────────────┘    │
│                              ↓                                          │
│   Matching Engine (2 ns - 487 ns)                                      │
│   ┌───────────────────────────────────────────────────────────────┐    │
│   │  Lock-free orderbook │ SIMD price comparison │ CPU pinning    │    │
│   │  NUMA-aware allocation │ Cache-line alignment │ Branch-free   │    │
│   └───────────────────────────────────────────────────────────────┘    │
│                              ↓                                          │
│   Persistence Layer (async, non-blocking)                              │
│   ┌───────────────────────────────────────────────────────────────┐    │
│   │  Write-ahead log │ Async commit │ Memory-mapped │ Batching    │    │
│   └───────────────────────────────────────────────────────────────┘    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Performance Principles

1. Zero Allocation on Hot Path

// BAD: Allocates on every order
func (ob *OrderBook) Match(order *Order) *Trade {
    trade := &Trade{} // Heap allocation
    return trade
}

// GOOD: Pool-allocated, zero heap allocations
func (ob *OrderBook) Match(order *Order) *Trade {
    trade := ob.tradePool.Get() // Pre-allocated pool
    trade.Reset()
    return trade
}

2. Lock-Free Data Structures

// Lock-free price level using atomic operations
type PriceLevel struct {
    price    atomic.Int64
    quantity atomic.Int64
    orders   atomic.Pointer[OrderList]
}

func (pl *PriceLevel) AddQuantity(qty int64) int64 {
    return pl.quantity.Add(qty)
}

3. Cache-Line Optimization

// Aligned to 64-byte cache line to prevent false sharing
type Order struct {
    _        [64]byte // Padding for cache line alignment
    ID       uint64
    Price    int64
    Quantity int64
    Side     uint8
    _        [39]byte // Padding to fill cache line
}

4. NUMA-Aware Memory

// Bind goroutine to CPU and allocate on local NUMA node
func (e *Engine) StartWorker(cpuID int) {
    runtime.LockOSThread()
    unix.SchedSetaffinity(0, &unix.CPUSet{cpuID: true})

    // All allocations now on local NUMA node
    e.localOrderBook = NewOrderBook()
}

Quick Performance Audit

# Full benchmark suite
make bench

# CPU profiling
go test -cpuprofile=cpu.prof -bench=BenchmarkOrderBook ./pkg/lx/
go tool pprof -http=:8080 cpu.prof

# Memory profiling
go test -memprofile=mem.prof -bench=BenchmarkOrderBook ./pkg/lx/
go tool pprof -http=:8081 mem.prof

# Trace analysis
go test -trace=trace.out -bench=BenchmarkOrderBook ./pkg/lx/
go tool trace trace.out

# Linux perf (kernel-level)
perf stat -e cycles,instructions,cache-misses ./lxd --benchmark

Performance Tuning Checklist

Operating System

# Increase network buffers
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
sysctl -w net.ipv4.tcp_wmem="4096 87380 134217728"

# Disable Nagle's algorithm
sysctl -w net.ipv4.tcp_nodelay=1

# Enable busy polling
sysctl -w net.core.busy_read=50
sysctl -w net.core.busy_poll=50

# Huge pages (2MB pages reduce TLB misses)
echo 1024 > /proc/sys/vm/nr_hugepages

Go Runtime

# Optimal GC settings for low latency
export GOGC=400              # Less frequent GC
export GOMEMLIMIT=8GiB       # Hard memory limit
export GOMAXPROCS=8          # Match physical cores

# Enable memory ballast (reduces GC frequency)
# In code: var ballast = make([]byte, 10<<30) // 10GB ballast

Hardware Requirements

Component	Minimum	Recommended	Ultra
CPU	8 cores	16+ cores	64+ cores
RAM	32 GB	128 GB	512 GB
Network	10 Gbps	25 Gbps	100 Gbps
Storage	NVMe SSD	NVMe RAID	Optane
GPU	-	Apple M2	M2 Ultra

Performance Documentation

Latency Optimization - Sub-microsecond techniques
Throughput Optimization - Scaling to millions/sec
Memory Optimization - GC tuning and pooling
CPU Optimization - Cache efficiency and SIMD
Network Optimization - Kernel bypass and tuning
Database Optimization - Indexing and caching
Profiling Guide - pprof and flame graphs
Backend Comparison - Go vs C++ vs GPU
Benchmark Results - Complete test data

Key Metrics to Monitor

// Critical performance metrics
type Metrics struct {
    OrderLatencyP50  time.Duration `metric:"order_latency_p50"`
    OrderLatencyP99  time.Duration `metric:"order_latency_p99"`
    OrderLatencyP999 time.Duration `metric:"order_latency_p999"`

    MatchThroughput  float64 `metric:"match_throughput_per_sec"`
    OrdersInFlight   int64   `metric:"orders_in_flight"`

    GCPauseP99       time.Duration `metric:"gc_pause_p99"`
    HeapInUse        uint64  `metric:"heap_in_use_bytes"`

    NetworkRTT       time.Duration `metric:"network_rtt_us"`
    TCPRetransmits   int64   `metric:"tcp_retransmits"`
}

Performance SLOs

Metric	Target	Alert Threshold
Order Latency P99	<10 us	>50 us
Match Latency P99	<1 us	>5 us
GC Pause P99	<1 ms	>5 ms
Throughput	>1M/sec	<500K/sec
Error Rate	<0.001%	>0.01%
Network RTT	<100 us	>1 ms

Performance Overview

On this page