Research

We benchmarked SUN across quality, cost, and speed against NotebookLM and Spotify (Opus 4.7)

A three-judge blind benchmark of AI-generated podcast scripts across SUN Course, SUN Podcast, NotebookLM, and Save-to-Spotify.

SUN AI & Research Team

Research

May 21, 2026

6 min read

We benchmarked SUN across quality, cost, and speed against NotebookLM and Spotify (Opus 4.7)

AI can generate a podcast in seconds. But generating one that actually works as audio, something a listener can follow on a commute, is much harder and still not fully solved.

We wanted to understand how SUN's scripts compare with the strongest publicly available alternatives, so we ran a fair blind benchmark.

What we tested

Four pipelines, 100 matched topics across nine content categories and four target durations (5, 15, 30, 45 minutes) distributed among 100 topics uniformly, 600 blind pairwise comparisons in a full round-robin, no cached judgments. The pipelines were SUN Course (single-speaker), SUN Podcast (two-speaker), Google's NotebookLM, and the Save-to-Spotify pipeline running on Anthropic's Claude Opus 4.7 with OpenAI TTS.

Each pair was judged by three large language models acting as blinded listeners: Claude Opus 4.7, Gemini 3.1 Pro Preview, and GPT-5.5 with high reasoning.

The judges only saw "Podcast A" and "Podcast B." They never saw product names. For each round, the Arena result was decided by a 2-out-of-3 majority vote across the three judges.

An important scope note before the results: this is a study of the written scripts, not rendered audio. Voice, prosody, music, and actual listening retention are the subject of a separate upcoming human-panel study.

Sun Course leads the four-pipeline tournament

Blind AI-Judge script preference rankings — Bradley-Terry scores

Bradley-Terry log-odds advantage over the weakest pipeline, with 95% bootstrap CIs. Higher is better.

Sun Course finished first by both Elo and Bradley-Terry score, with a 191-109 majority record across 300 rounds. Of its three pairwise matchups, two reached statistical significance after Bonferroni correction across all six pairwise tests:

Sun Course vs Save-to-Spotify: 66 to 34 (Bonferroni p = 0.011). Significant. This is the only Sun-vs-external matchup that survives correction.
Sun Course vs NotebookLM: 57 to 43. Directionally Sun Course, but the margin does not survive Bonferroni correction at n = 100 (raw p = 0.19). The three judges also disagreed on this pair: Gemini and GPT-5.5 both preferred Sun Course, while Claude Opus 4.7 narrowly preferred NotebookLM 52-48.

What each judge thought, separately

Judge preference distribution across 600 evaluations

Each judge's overall winning votes (out of 600 calls). Two of three judges put Sun Course first overall. Opus 4.7 puts NotebookLM narrowly first; Sun Course is its second pick.

We deliberately wanted at least one judge from a different lab than any pipeline we benchmarked. Save-to-Spotify uses Claude Opus 4.7 under the hood; Claude Opus 4.7 is also one of our three judges. If anything, that creates a conservative bias against Sun Course in the headline matchup, since Opus is the only judge that does not put Sun Course on top. Gemini 3.1 Pro and GPT-5.5 do.

The aggregate, reported carefully

Across all SUN-vs-external pairings, SUN variants won 234 of 400 majority-vote rounds — a 58.5% win rate (95% Wilson CI 53.6%–63.2%; exact two-sided p = 7.87e-4). This aggregate is a real product-level signal, but it pools one decisive matchup (Sun Course vs Save-to-Spotify) with three closer ones. We are sharing it because we think the product-level signal is meaningful; we are not using it to back up any per-matchup claim.

Cost and time

For the first 10 prompts in the benchmark (C-001 through C-010 with varying lengths of 5, 15, 30, 45 minutes), we have full cost and generation-time data across both SUN pipelines, NotebookLM, and Save-to-Spotify. (Save-to-Spotify generation has completed only the first 10 prompts so far; we'll publish the remainder when ready. But the cost and time numbers should not vary much, as we have shown this to be statistically true on the NotebookLM dataset, which has time information for the whole set of 100 prompts.)

Aggregate cost per finished minute of generated audio, summed across C-001..C-010.

Generation seconds per finished minute of audio, summed across C-001..C-010.

Sun Course (V1) costs about $0.036 per finished minute of audio; Sun Podcast (V2) costs about $0.038 per finished minute.
Save-to-Spotify (Claude Opus 4.7 + OpenAI TTS) costs about $0.137 per finished minute — roughly 3.6–3.8× more than either SUN pipeline.
NotebookLM's actual cost is unknown.
On generation time per finished minute, Sun Podcast was the fastest on this sample (~24 s/min), with NotebookLM and Sun Course close behind (~24 s/min and ~26 s/min). Save-to-Spotify took about 38 s per finished minute — roughly 47% slower than Sun Course and ~61% slower than Sun Podcast.

How the test was run

100 topics across nine categories and four target durations.
Full 4-way round-robin: 600 majority-vote rounds.
Side assignment (Podcast A vs Podcast B) randomized SHA-256 by sample and pair, so no pipeline was locked to one side.
Each round judged by three language models from three labs; the 2-of-3 majority decided the Arena outcome.
Significance: exact two-sided binomial sign tests over the 100 matched topics per pair, Bonferroni-corrected across six pairwise comparisons.
Bradley-Terry log-strengths fit by MM iteration, 95% CIs from 1,000 pair-level bootstraps.

What's next

A human listening panel on rendered audio.

Hear the difference yourself

Generate a free audio course on any topic.

Start listening →