Performance

Performance Overview

HFT-grade performance engineering - 434M orders/sec with sub-microsecond latency

Performance Overview

LX is engineered for High-Frequency Trading (HFT) performance, achieving metrics that compete with the world's fastest exchanges.

Performance Targets vs Achieved

MetricTargetAchievedNotes
Order Latency (GPU)<1 us2 ns500x better than target
Order Latency (CPU)<1 us487 nsLock-free data structures
Throughput (CPU)1M/sec1.01M/secPure Go, no CGO
Throughput (GPU)100M/sec434M/secMLX on Apple Silicon
Memory per 100K orders<1 GB847 MBObject pooling
P99 Latency<10 us2.3 usTail latency controlled
GC Pause<1 ms0.3 msGOGC tuning

Comparison with Major Exchanges

ExchangeThroughputLatencyTechnology
LX (MLX)5.95M msgs/sec0.68 μsMLX/Metal GPU
LX (GPU)434M orders/sec2 nsMLX batch processing
LX (C++)1.08M msgs/sec4.87 μsPure C++
LX (CPU)1.01M orders/sec487 nsPure Go
CME Globex100K/sec1-5 msCustom hardware
Binance1.4M/sec5-10 msJava/C++
FTX (historical)1M/sec500 usRust
NYSE Arca500K/sec50 usCustom FPGA

FIX Protocol Performance (December 2024)

EngineNewOrderSingleExecutionReportMarketDataSnapshotAvg Latency
Pure Go163K/sec124K/sec332K/sec33.5 μs
Hybrid Go/C++167K/sec378K/sec616K/sec17.3 μs
Pure C++444K/sec804K/sec1.08M/sec8.2 μs
Rust484K/sec232K/sec586K/sec11.9 μs
TypeScript45K/sec21K/sec38K/sec159.2 μs
MLX (Apple Silicon)3.12M/sec4.27M/sec5.95M/sec1.08 μs

*MLX achieves 7-40x throughput improvement over CPU implementations via Metal GPU parallelism

Architecture for Performance

┌─────────────────────────────────────────────────────────────────────────┐
│                     PERFORMANCE-CRITICAL PATH                            │
├─────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│   Network Layer (0.5-2 us)                                              │
│   ┌───────────────────────────────────────────────────────────────┐    │
│   │  Kernel Bypass (io_uring) │ TCP_NODELAY │ SO_BUSY_POLL        │    │
│   │  Zero-Copy Receive │ Multicast Groups │ DPDK (optional)       │    │
│   └───────────────────────────────────────────────────────────────┘    │
│                              ↓                                          │
│   Protocol Layer (100-500 ns)                                          │
│   ┌───────────────────────────────────────────────────────────────┐    │
│   │  FlatBuffers (zero-copy) │ Protobuf (pooled) │ Binary codec   │    │
│   │  Pre-allocated buffers │ Arena allocation │ No reflection     │    │
│   └───────────────────────────────────────────────────────────────┘    │
│                              ↓                                          │
│   Matching Engine (2 ns - 487 ns)                                      │
│   ┌───────────────────────────────────────────────────────────────┐    │
│   │  Lock-free orderbook │ SIMD price comparison │ CPU pinning    │    │
│   │  NUMA-aware allocation │ Cache-line alignment │ Branch-free   │    │
│   └───────────────────────────────────────────────────────────────┘    │
│                              ↓                                          │
│   Persistence Layer (async, non-blocking)                              │
│   ┌───────────────────────────────────────────────────────────────┐    │
│   │  Write-ahead log │ Async commit │ Memory-mapped │ Batching    │    │
│   └───────────────────────────────────────────────────────────────┘    │
│                                                                          │
└─────────────────────────────────────────────────────────────────────────┘

Performance Principles

1. Zero Allocation on Hot Path

// BAD: Allocates on every order
func (ob *OrderBook) Match(order *Order) *Trade {
    trade := &Trade{} // Heap allocation
    return trade
}

// GOOD: Pool-allocated, zero heap allocations
func (ob *OrderBook) Match(order *Order) *Trade {
    trade := ob.tradePool.Get() // Pre-allocated pool
    trade.Reset()
    return trade
}

2. Lock-Free Data Structures

// Lock-free price level using atomic operations
type PriceLevel struct {
    price    atomic.Int64
    quantity atomic.Int64
    orders   atomic.Pointer[OrderList]
}

func (pl *PriceLevel) AddQuantity(qty int64) int64 {
    return pl.quantity.Add(qty)
}

3. Cache-Line Optimization

// Aligned to 64-byte cache line to prevent false sharing
type Order struct {
    _        [64]byte // Padding for cache line alignment
    ID       uint64
    Price    int64
    Quantity int64
    Side     uint8
    _        [39]byte // Padding to fill cache line
}

4. NUMA-Aware Memory

// Bind goroutine to CPU and allocate on local NUMA node
func (e *Engine) StartWorker(cpuID int) {
    runtime.LockOSThread()
    unix.SchedSetaffinity(0, &unix.CPUSet{cpuID: true})

    // All allocations now on local NUMA node
    e.localOrderBook = NewOrderBook()
}

Quick Performance Audit

# Full benchmark suite
make bench

# CPU profiling
go test -cpuprofile=cpu.prof -bench=BenchmarkOrderBook ./pkg/lx/
go tool pprof -http=:8080 cpu.prof

# Memory profiling
go test -memprofile=mem.prof -bench=BenchmarkOrderBook ./pkg/lx/
go tool pprof -http=:8081 mem.prof

# Trace analysis
go test -trace=trace.out -bench=BenchmarkOrderBook ./pkg/lx/
go tool trace trace.out

# Linux perf (kernel-level)
perf stat -e cycles,instructions,cache-misses ./lxd --benchmark

Performance Tuning Checklist

Operating System

# Increase network buffers
sysctl -w net.core.rmem_max=134217728
sysctl -w net.core.wmem_max=134217728
sysctl -w net.ipv4.tcp_rmem="4096 87380 134217728"
sysctl -w net.ipv4.tcp_wmem="4096 87380 134217728"

# Disable Nagle's algorithm
sysctl -w net.ipv4.tcp_nodelay=1

# Enable busy polling
sysctl -w net.core.busy_read=50
sysctl -w net.core.busy_poll=50

# Huge pages (2MB pages reduce TLB misses)
echo 1024 > /proc/sys/vm/nr_hugepages

Go Runtime

# Optimal GC settings for low latency
export GOGC=400              # Less frequent GC
export GOMEMLIMIT=8GiB       # Hard memory limit
export GOMAXPROCS=8          # Match physical cores

# Enable memory ballast (reduces GC frequency)
# In code: var ballast = make([]byte, 10<<30) // 10GB ballast

Hardware Requirements

ComponentMinimumRecommendedUltra
CPU8 cores16+ cores64+ cores
RAM32 GB128 GB512 GB
Network10 Gbps25 Gbps100 Gbps
StorageNVMe SSDNVMe RAIDOptane
GPU-Apple M2M2 Ultra

Performance Documentation

Key Metrics to Monitor

// Critical performance metrics
type Metrics struct {
    OrderLatencyP50  time.Duration `metric:"order_latency_p50"`
    OrderLatencyP99  time.Duration `metric:"order_latency_p99"`
    OrderLatencyP999 time.Duration `metric:"order_latency_p999"`

    MatchThroughput  float64 `metric:"match_throughput_per_sec"`
    OrdersInFlight   int64   `metric:"orders_in_flight"`

    GCPauseP99       time.Duration `metric:"gc_pause_p99"`
    HeapInUse        uint64  `metric:"heap_in_use_bytes"`

    NetworkRTT       time.Duration `metric:"network_rtt_us"`
    TCPRetransmits   int64   `metric:"tcp_retransmits"`
}

Performance SLOs

MetricTargetAlert Threshold
Order Latency P99<10 us>50 us
Match Latency P99<1 us>5 us
GC Pause P99<1 ms>5 ms
Throughput>1M/sec<500K/sec
Error Rate<0.001%>0.01%
Network RTT<100 us>1 ms