Performance
Performance Overview HFT-grade performance engineering - 434M orders/sec with sub-microsecond latency
LX is engineered for High-Frequency Trading (HFT) performance, achieving metrics that compete with the world's fastest exchanges.
Metric Target Achieved Notes Order Latency (GPU) <1 us 2 ns 500x better than target Order Latency (CPU) <1 us 487 ns Lock-free data structures Throughput (CPU) 1M/sec 1.01M/sec Pure Go, no CGO Throughput (GPU) 100M/sec 434M/sec MLX on Apple Silicon Memory per 100K orders <1 GB 847 MB Object pooling P99 Latency <10 us 2.3 us Tail latency controlled GC Pause <1 ms 0.3 ms GOGC tuning
Exchange Throughput Latency Technology LX (MLX) 5.95M msgs/sec 0.68 μs MLX/Metal GPU LX (GPU) 434M orders/sec 2 ns MLX batch processing LX (C++) 1.08M msgs/sec 4.87 μs Pure C++ LX (CPU) 1.01M orders/sec 487 ns Pure Go CME Globex 100K/sec 1-5 ms Custom hardware Binance 1.4M/sec 5-10 ms Java/C++ FTX (historical) 1M/sec 500 us Rust NYSE Arca 500K/sec 50 us Custom FPGA
Engine NewOrderSingle ExecutionReport MarketDataSnapshot Avg Latency Pure Go 163K/sec 124K/sec 332K/sec 33.5 μs Hybrid Go/C++ 167K/sec 378K/sec 616K/sec 17.3 μs Pure C++ 444K/sec 804K/sec 1.08M/sec 8.2 μs Rust 484K/sec 232K/sec 586K/sec 11.9 μs TypeScript 45K/sec 21K/sec 38K/sec 159.2 μs MLX (Apple Silicon) 3.12M/sec 4.27M/sec 5.95M/sec 1.08 μs
*MLX achieves 7-40x throughput improvement over CPU implementations via Metal GPU parallelism
┌─────────────────────────────────────────────────────────────────────────┐
│ PERFORMANCE-CRITICAL PATH │
├─────────────────────────────────────────────────────────────────────────┤
│ │
│ Network Layer (0.5-2 us) │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Kernel Bypass (io_uring) │ TCP_NODELAY │ SO_BUSY_POLL │ │
│ │ Zero-Copy Receive │ Multicast Groups │ DPDK (optional) │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ Protocol Layer (100-500 ns) │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ FlatBuffers (zero-copy) │ Protobuf (pooled) │ Binary codec │ │
│ │ Pre-allocated buffers │ Arena allocation │ No reflection │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ Matching Engine (2 ns - 487 ns) │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Lock-free orderbook │ SIMD price comparison │ CPU pinning │ │
│ │ NUMA-aware allocation │ Cache-line alignment │ Branch-free │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ ↓ │
│ Persistence Layer (async, non-blocking) │
│ ┌───────────────────────────────────────────────────────────────┐ │
│ │ Write-ahead log │ Async commit │ Memory-mapped │ Batching │ │
│ └───────────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────────────┘
// BAD: Allocates on every order
func ( ob * OrderBook ) Match ( order * Order ) * Trade {
trade := & Trade {} // Heap allocation
return trade
}
// GOOD: Pool-allocated, zero heap allocations
func ( ob * OrderBook ) Match ( order * Order ) * Trade {
trade := ob.tradePool. Get () // Pre-allocated pool
trade. Reset ()
return trade
}
// Lock-free price level using atomic operations
type PriceLevel struct {
price atomic . Int64
quantity atomic . Int64
orders atomic . Pointer [ OrderList ]
}
func ( pl * PriceLevel ) AddQuantity ( qty int64 ) int64 {
return pl.quantity. Add (qty)
}
// Aligned to 64-byte cache line to prevent false sharing
type Order struct {
_ [ 64 ] byte // Padding for cache line alignment
ID uint64
Price int64
Quantity int64
Side uint8
_ [ 39 ] byte // Padding to fill cache line
}
// Bind goroutine to CPU and allocate on local NUMA node
func ( e * Engine ) StartWorker ( cpuID int ) {
runtime. LockOSThread ()
unix. SchedSetaffinity ( 0 , & unix . CPUSet {cpuID: true })
// All allocations now on local NUMA node
e.localOrderBook = NewOrderBook ()
}
# Full benchmark suite
make bench
# CPU profiling
go test -cpuprofile=cpu.prof -bench=BenchmarkOrderBook ./pkg/lx/
go tool pprof -http=:8080 cpu.prof
# Memory profiling
go test -memprofile=mem.prof -bench=BenchmarkOrderBook ./pkg/lx/
go tool pprof -http=:8081 mem.prof
# Trace analysis
go test -trace=trace.out -bench=BenchmarkOrderBook ./pkg/lx/
go tool trace trace.out
# Linux perf (kernel-level)
perf stat -e cycles,instructions,cache-misses ./lxd --benchmark
# Increase network buffers
sysctl -w net.core.rmem_max= 134217728
sysctl -w net.core.wmem_max= 134217728
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
sysctl -w net.ipv4.tcp_wmem="4096 87380 134217728"
# Disable Nagle's algorithm
sysctl -w net.ipv4.tcp_nodelay= 1
# Enable busy polling
sysctl -w net.core.busy_read= 50
sysctl -w net.core.busy_poll= 50
# Huge pages (2MB pages reduce TLB misses)
echo 1024 > /proc/sys/vm/nr_hugepages
# Optimal GC settings for low latency
export GOGC = 400 # Less frequent GC
export GOMEMLIMIT = 8GiB # Hard memory limit
export GOMAXPROCS = 8 # Match physical cores
# Enable memory ballast (reduces GC frequency)
# In code: var ballast = make([]byte, 10<<30) // 10GB ballast
Component Minimum Recommended Ultra CPU 8 cores 16+ cores 64+ cores RAM 32 GB 128 GB 512 GB Network 10 Gbps 25 Gbps 100 Gbps Storage NVMe SSD NVMe RAID Optane GPU - Apple M2 M2 Ultra
// Critical performance metrics
type Metrics struct {
OrderLatencyP50 time . Duration `metric:"order_latency_p50"`
OrderLatencyP99 time . Duration `metric:"order_latency_p99"`
OrderLatencyP999 time . Duration `metric:"order_latency_p999"`
MatchThroughput float64 `metric:"match_throughput_per_sec"`
OrdersInFlight int64 `metric:"orders_in_flight"`
GCPauseP99 time . Duration `metric:"gc_pause_p99"`
HeapInUse uint64 `metric:"heap_in_use_bytes"`
NetworkRTT time . Duration `metric:"network_rtt_us"`
TCPRetransmits int64 `metric:"tcp_retransmits"`
}
Metric Target Alert Threshold Order Latency P99 <10 us >50 us Match Latency P99 <1 us >5 us GC Pause P99 <1 ms >5 ms Throughput >1M/sec <500K/sec Error Rate <0.001% >0.01% Network RTT <100 us >1 ms