The Compute Moat: Why Infrastructure Is the New AI Frontier

Lecture 2

Inside Colossus: 220,000 GPUs and Your API Limits

The Compute Moat: Why Infrastructure Is the New AI Frontier

LECTURE 1 • 4 min

The New Compute Moat: Why Your AI Tools Are Throttled

LECTURE 2 • 7 min

LECTURE 3 • 5 min

The $100 Billion Grid: Amazon, Google, and Sovereign AI

LECTURE 4 • 7 min

Hitting the Wall: Why the Next AI Might Be Orbital

Listen for free in the SUN app:

Transcript

SPEAKER_1: Alright, so last time we established that AI companies are spending over 80% of their capital on compute, not researchers, not algorithms—compute. And that physical reality flows straight down to the rate limits users hit every day. I want to get specific now, because there's a cluster that keeps coming up in this conversation: Colossus. SPEAKER_2: Good place to pick up. Colossus is xAI's supercomputer, built in Memphis, and it became operational with over 200,000 H100 GPUs—making it the world's largest AI training cluster as of 2025. That number is not a rounding error. It's a genuine infrastructure milestone. SPEAKER_1: So what does 200,000 H100s actually mean in practice? Our listener might be wondering—is that just a big number, or does it translate into something tangible? SPEAKER_2: It's very tangible. That cluster has enough compute to train a GPT-3 scale model in under two hours. For context, that same job used to take weeks on earlier infrastructure. xAI used Colossus—running at roughly ten times the compute of prior state-of-the-art systems—to train Grok 3. The scale changes what's even possible to build. SPEAKER_1: And xAI isn't stopping there, right? There's an expansion in the works. SPEAKER_2: Correct. The Abilene site—hosting Crusoe, Stargate, and OpenAI—is projected to hold 400,000 to 500,000 Blackwell chips in NVL72 racks in 2026, at an estimated cost of $22 to $35 billion. That's a different order of magnitude again. Those NVL72 racks are specifically designed for the kind of compute needed to train models requiring around four times ten to the twenty-seventh floating point operations. SPEAKER_1: Okay, so training is one thing. But what our listener is actually feeling day-to-day is inference—the model answering their question. How does this hardware picture connect to rate limits? SPEAKER_2: That's exactly the right question. Flagship models like Gemini 3 Pro and Opus 4.5 have multiple trillions of parameters. Serving those models to users requires gigawatt-scale inference compute with substantial high-bandwidth memory. Nvidia's GB200 and GB300 NVL72 systems offer 14 to 20 terabytes of HBM per scale-up world. Google's Trillium TPUv6e provides 8 terabytes per 256-chip pod. These aren't luxury specs—they're the minimum to serve trillion-parameter models without choking. SPEAKER_1: So if a company doesn't have enough of that hardware... they throttle. SPEAKER_2: Precisely. OpenAI, for instance, lacks sufficient Nvidia hardware in 2026 to serve roughly six-trillion-parameter models to all users without restrictions. The math is brutal: a Bloomberg report indicated only 16,000 GB200 chips at Abilene by summer 2025, and 64,000 by end of 2026. Meanwhile, Colossus with 200,000 H100 and H200 GPUs already exceeds that projected total. xAI will have enough NVL72 systems by January 2026 to serve its six-trillion-parameter models efficiently. That gap is why some tools feel fast and others feel like they're rationing access. SPEAKER_1: And Anthropic's Claude models—where do they sit in this picture? SPEAKER_2: Claude faces real serving constraints right now. Insufficient hardware is directly causing rate limits and higher prices for Anthropic's users. That's not a policy choice—it's a hardware shortage. Nvidia's systems are estimated to lag roughly a year behind for efficiently serving trillion-parameter models, until 2028 or 2029 improvements arrive. So the throttling Alina or any developer hits when using Claude Code isn't arbitrary. It's a physical ceiling. SPEAKER_1: How does the cost side of this work? Because one gigawatt of compute sounds abstract until someone puts a dollar figure on it. SPEAKER_2: One gigawatt of AI compute costs approximately $12 billion per year to operate. Three months on one gigawatt can produce 300 trillion output tokens—for tasks like reinforcement learning from verifiable rewards. API providers charge around $10 per million output tokens for large model inference. So the economics are tight. A single RLVR task may require 300,000 tokens across 16 runs just to evaluate success. The compute bill accumulates fast, and companies without the infrastructure to absorb it push costs onto users through pricing or restrict access through rate limits. SPEAKER_1: So if I'm following this correctly—the companies that secured the most hardware early are now able to offer better access at lower friction, while everyone else is either raising prices or throttling demand. SPEAKER_2: That's the dynamic exactly. Training compute has seen a 14x increase over the two years leading into 2026 frontier systems. The companies that anticipated that curve and locked in capacity are now structurally ahead. The ones that didn't are managing scarcity. SPEAKER_1: There's a centralization risk here too, isn't there? Concentrating this much compute in one physical location—Memphis, Abilene—that's a vulnerability as much as it's a strength. SPEAKER_2: Absolutely. A single point of failure—power grid disruption, cooling failure, regulatory action—can affect millions of users simultaneously. The same concentration that makes Colossus powerful makes it fragile. And geopolitically, it means the tools available to someone in Europe or Southeast Asia are entirely dependent on infrastructure decisions made in a handful of American cities. SPEAKER_1: So for our listener trying to make sense of why their AI tools behave so differently—what's the one thing they should carry forward from this? SPEAKER_2: The responsiveness gap between AI tools is not about which company has smarter engineers. It's about who has the hardware to serve trillion-parameter models at scale without rationing. Anthropic's move to access Colossus-class infrastructure is a direct attempt to close that gap—to remove the throttling that has pushed developers away from high-end models like Opus. The compute race isn't background noise. It's the mechanism that determines which tools remain reliable and which ones keep hitting walls.