SPEAKER_1: Alright, so last time we got deep into MLA — the compression trick that lets DeepSeek move less data and still run faster. That 15x cache reduction number really stuck. But today I want to shift from architecture to performance. Because the real question everyone's been sitting with is: does any of this actually show up in the benchmarks? SPEAKER_2: That's exactly the right pivot. And the short answer is yes — decisively in some areas. But let's be precise about where, because the picture is more nuanced than the headlines suggest. SPEAKER_1: So where does DeepSeek actually beat the giants? Give me the specific numbers. SPEAKER_2: Start with math. On AIME 2024 — one of the hardest high school math competitions used as an AI benchmark — DeepSeek R1 scored 79.8%. OpenAI's o1 scored 79.2%. On MATH-500, a broader mathematical reasoning suite, R1 hit 97.3% versus o1's 96.4%. These aren't rounding differences. R1 is genuinely ahead on mathematical reasoning. SPEAKER_1: And coding? Because that's where a lot of developers are paying close attention. SPEAKER_2: Codeforces rating — a competitive programming benchmark — R1 scored 1,962 versus o1's 1,891. And DeepSeek V3 scores 54.8% on SWE-bench, which tests real-world software engineering tasks. GPT-5 is at 63%, so there's still a gap there, but V3 is competitive at a fraction of the cost. SPEAKER_1: So for someone like Yunying tracking this space — DeepSeek isn't just cheaper, it's actually ahead in specific domains. SPEAKER_2: Right. Math and competitive programming are where R1 leads. The gap narrows in general software engineering, and it widens again in open-ended creative tasks. The architecture optimizes for structured reasoning — problems with verifiable answers — and that's exactly where the benchmarks show it pulling ahead. SPEAKER_1: Why does that specialization happen? How does the architecture produce that pattern? SPEAKER_2: It comes back to how R1 was trained. Reinforcement learning with chain-of-thought reasoning means the model learns to verify its own outputs against ground truth. Math and code have ground truth — the answer is right or wrong. Creative writing doesn't. So the training signal is much stronger for structured domains, and the model's internal verification loop is more useful there. SPEAKER_1: That's a clean explanation. So the hallucination question — which everyone asks — is actually tied to this same mechanism? SPEAKER_2: Exactly. DeepSeek's approach to hallucination isn't a separate filter. It's structural. When the model reasons through a chain of thought and checks intermediate steps against verifiable criteria, it catches more of its own errors before output. That's different from post-hoc filtering. It doesn't eliminate hallucination, but it reduces it in domains where verification is possible. SPEAKER_1: What's the common misconception people have when they first look at these benchmarks? SPEAKER_2: That benchmark performance equals general capability. DeepSeek R1 beating o1 on AIME doesn't mean it's a better model in every context. The misconception is treating benchmarks as a single leaderboard. They're more like sport-specific rankings — a sprinter and a marathon runner can both be world-class athletes. SPEAKER_1: Fair. Now the cost side — because this is where the disruption really hits. What are we talking about in terms of inference pricing? SPEAKER_2: R1's API is priced at $0.55 per million input tokens and $2.19 per million output tokens. That's 96% cheaper than OpenAI o1. And DeepSeek's inference costs overall are 20 to 50 times cheaper than any frontier lab. That gap forced OpenAI, Anthropic, and Google to cut their own reasoning model prices after R1 launched. SPEAKER_1: So DeepSeek didn't just compete — it repriced the entire market. SPEAKER_2: That's the right framing. And consider the operational context: DeepSeek runs with roughly 100 people, zero venture capital funding, and around 10,000 GPUs. They matched frontier AI performance at one-tenth the training cost of the giants. The $6 million training figure for V3 isn't a one-off — it reflects a systematic cost discipline built into every architectural decision. SPEAKER_1: The environmental angle is something I didn't expect to be relevant here. Is that real? SPEAKER_2: It's real and it's significant. GPT-4o at full scale emits CO₂ equivalent to roughly 380,000 cars annually. DeepSeek R1 is estimated at around 115,000 cars. That's not a marginal difference — it's a structural consequence of running fewer active parameters per token, which is the MoE efficiency we covered in lecture two. SPEAKER_1: And looking forward — what's the trajectory? Because V4 numbers are starting to circulate. SPEAKER_2: DeepSeek V4 is a 1-trillion-parameter model, maintaining efficiency with 50 to 60 billion active parameters per token, offering larger total capacity. It pushes the context window to 1 million tokens and introduces new mechanisms including multi-head Conditional attention and DSA sparse attention. Early projections suggest it's expected to surpass GPT-5 and Gemini 3 Ultra on benchmarks. SPEAKER_1: So for our listener trying to hold the big picture — what's the one thing they should carry forward from this benchmark battle? SPEAKER_2: That DeepSeek models don't just compete on cost — they outperform industry leaders in the specific domains that matter most for developers: coding, mathematics, and logical reasoning. The assumption that you need the biggest lab, the most funding, and the most chips to win on benchmarks has been empirically disproven. That's not a marketing claim. The numbers are public, the methodology is reproducible, and the market already responded — Nvidia lost $589 billion in a single day when R1 dropped.