The DeepSeek Revolution: Architecture, Economy, and the New AI Order
Lecture 6

Mastering Logic: The Rise of DeepSeek-R1

The DeepSeek Revolution: Architecture, Economy, and the New AI Order

Transcript

A model trained with zero supervised examples — no labeled answers, no human demonstrations — taught itself to solve competition mathematics, jumping from a 15.6% score on AIME 2024 to 71.0% purely through reinforcement learning. The DeepSeek team documented this in their January 2025 technical paper on arXiv, and the number is not a typo. That model is DeepSeek-R1-Zero, the raw precursor to R1, and what it revealed about machine reasoning changed the assumptions of an entire field. Last lecture established that DeepSeek's pricing reset the entire token economy, forcing OpenAI and Google to cut costs within weeks of R1's release. But the economic disruption only happened because the underlying model was genuinely competitive — and understanding why requires looking at how R1 was actually built. R1 is a first-generation reasoning model that uniquely leverages large-scale reinforcement learning, bypassing traditional supervised fine-tuning. This approach allows R1-Zero to learn by attempting problems, receiving feedback, and iterating, which is a significant departure from models that rely on labeled examples. No human-curated reasoning chains. No step-by-step demonstrations. Just trial, feedback, and adaptation at scale. The training pipeline for the full R1 model adds a cold-start phase — a small set of curated examples establishing logical answer formatting — before RL begins, then uses a technique called GRPO to sharpen problem-solving further. The result: R1 achieves 79.8% Pass@1 on AIME 2024, edging past OpenAI o1's 79.2%, and 97.3% on MATH-500, matching o1 while costing 90 to 95% less to run at inference. The innovative mechanism here is Chain-of-Thought reasoning. R1 autonomously breaks problems into explicit steps, verifies each against logical criteria, and revises, showcasing a novel approach to problem-solving. This is structurally different from retrieval. It is computation. And here is what makes it counterintuitive, Yunying: during RL training, researchers witnessed an "aha moment" — a phase where R1-Zero autonomously extended its reasoning chains for complex problems, demonstrating emergent problem-solving behavior. No programmer wrote that behavior. It emerged. R1 also generalizes well beyond math: 87.6% length-controlled win-rate on AlpacaEval 2.0, 92.3% on ArenaHard, strong performance on MMLU, GPQA Diamond, and FRAMES long-context QA. Reasoning patterns from R1 can even be distilled into smaller models, spreading the capability without the full compute cost. The takeaway, Yunying, is precise: DeepSeek-R1 proved that reinforcement learning alone — without supervised imitation — can produce a model that genuinely thinks through problems, self-corrects, and generalizes. The path to reasoning does not require more data. It requires better feedback.