The Reality Check: Models vs. Systems
The Data Flywheel: Feeding the Beast
MLOps: The CI/CD of Intelligence
The Silent Killer: Model Drift and Decay
Evaluation: Beyond the Benchmarks
The Safety Net: Ethics and Guardrails
The Bottom Line: Economics of Scale
Closing the Loop: The Living System
A model scored near-perfect on every standard benchmark — then failed catastrophically in deployment. This highlights the critical need for evaluation that aligns with business goals, as discussed in the Missing Science of AI Evaluation talk scheduled for April 8, 2026, at UC Berkeley. Researchers there argue that current benchmarks routinely test unintended abilities, lack predictive power for unseen tasks, and hit performance saturation — meaning top models cluster so tightly at the ceiling that benchmarks can no longer differentiate them at all. Last lecture established that drift is the default trajectory of any deployed model — and the right response is observability and automated triggers. That framing matters here, Yuan, because evaluation is the mechanism that makes observability meaningful. Without valid metrics, you are monitoring noise. The core problem with benchmarks is structural: they cannot capture strategic business impacts such as revenue implications, user satisfaction, or compliance with regulations. Humanity's Last Exam, a 2,500-question benchmark built by 1,000 experts and released March 13, 2026, exposed this directly — its designers removed every question current AI could solve, revealing capability gaps that standard leaderboards had completely hidden. So what does rigorous evaluation actually look like? Slice-based evaluation is one of the most powerful and underused tools available. Instead of reporting a single aggregate accuracy number, you segment your data by meaningful subgroups — demographic cohorts, geographic regions, edge-case categories — and measure performance on each slice independently. A model with a strong overall F1 score can be quietly failing on a specific user segment that represents your highest-value customers. That failure is invisible in aggregate metrics and only surfaces when you cut the data. The ADeLe framework, published in 2026 by Cambridge Judge Business School, takes this further: it profiles both benchmarks and models by capability dimensions, enabling transferable performance predictions across diverse AI ecosystems and on tasks the model has never seen. Here is where business alignment becomes non-negotiable, Yuan. Evaluation must measure revenue impact, user retention, and regulatory compliance — metrics that directly affect business outcomes and stakeholder interests. Risk-tailored evaluation, as defined in Stanford's Law and AI Evals framework, measures system-level behavior: real-world benefits, risks, and business outcomes, compared against baselines like previous model versions, industry standards, or human performance. Effective AI governance, that same framework argues, requires ongoing legally grounded evaluation — not episodic benchmark runs before a launch. A/B testing is the operational bridge between offline evaluation and real-world validation. In health AI specifically, online A/B testing post-deployment catches unpredictable failure modes that no benchmark surfaces — a finding highlighted at QCon London in March 2026. The logistical challenges are real: you need sufficient traffic volume to reach statistical significance, clean experiment isolation to avoid contamination between variants, and guardrails to limit user exposure to underperforming models. Human-in-the-loop evaluation is the final layer, essential for edge cases where automated metrics break down. The AI-SQE workshop in April 2026 formalized agent-as-judge approaches alongside human review as complementary methods — not alternatives. Together, offline slice-based analysis, A/B testing, and human review form a complete evaluation lifecycle that benchmarks alone cannot replicate. The takeaway is this, Yuan: real-world evaluation must prioritize business KPIs and user experience over benchmark leaderboards. A model that excels in public rankings but harms your core business metrics is a liability, not an asset. Build evaluation into every stage — offline before deployment, online after it — and treat human judgment as a required input, not a fallback. The benchmark tells you the model can play the game. Only production evaluation tells you whether it can win the one that matters.