OpenClaw: Advanced Memory Architectures
Lecture 6

Caches and Crates: Optimizing for Modern Hardware

OpenClaw: Advanced Memory Architectures

Transcript

An Intel Haswell processor can push 512 bytes through its pipeline every single clock cycle — but its connection to main RAM delivers just 10 bytes per cycle. That gap, documented in TU Munich's data processing on modern hardware research, is not a bug. It's the defining constraint of every game engine running today. DRAM, the memory filling your gigabytes of system RAM, stores each bit in a capacitor that must be refreshed every 64 milliseconds or it loses state. SRAM, the memory inside CPU caches, uses bistable latches — six transistors per cell versus DRAM's one — and drives output lines actively, giving you nearly instantaneous access. The physics of those two technologies create a hierarchy you cannot negotiate around. In this lecture, we will focus on optimizing cache hierarchies and data-oriented design, moving beyond the memory leak discussions from the previous lecture. That working set discipline matters here too, because what lives in cache is determined entirely by what you recently touched. Modern processors mitigate DRAM latency with advanced cache hierarchies, emphasizing spatial locality and prefetching strategies. L1 cache sits closest to the core — typically 8 to 16 kilobytes, kept deliberately small because larger caches mean longer signal lines, higher capacitive loads, and slower circuits. That size-versus-speed tradeoff is fundamental, not incidental. L2 and L3 caches scale up in size and down in speed at each level. Haswell's main memory latency hits 100 cycles — roughly 200 CPU cycles for a full DRAM access. An L1 hit costs single-digit cycles. That 20-to-1 ratio means one cache miss can erase the benefit of dozens of fast operations. Cache lines are the unit of transfer between levels — the smallest granularity the memory hierarchy moves. When you access one byte, the entire cache line comes with it. This is where OpenClaw's pool allocators pay a second dividend beyond allocation speed. When all active projectile objects live in a contiguous pool, iterating them for physics updates means sequential memory access — the CPU prefetcher reads ahead automatically, pulling cache lines before the core requests them. Prefetching reduces memory requests at the cost of some power consumption, but the frame-time savings dominate. Write coalescing — batching multiple writes before committing — compounds this by reducing per-request overhead on the memory bus. Scattered heap allocations destroy both advantages simultaneously. It's crucial to understand that larger caches alone do not guarantee better performance; efficient access patterns are key. A fully associative cache requires a comparator for every cache line — hardware cost scales with size, and access time grows with signal length. The real lever is access pattern. Data-oriented design, which groups related fields contiguously rather than bundling them inside object hierarchies, can outperform traditional object-oriented layouts dramatically — not because the cache got bigger, but because the working set got tighter. Spatial locality — keeping related data physically close in memory — is the most effective tool you have for extracting performance from modern hardware. The cache is already fast. Your job is to stop wasting it.