Genesis Prompt-Format Arena: Which Format Makes the Best Audio Course?

The 15-minute two-speaker podcast single episode wins the arena, the only format with a 100% gate pass rate and the highest Elo at 1096.49.
Why we built this benchmark
When we set out to build SUN, we made a bet: audio quality is the product. A learner spending 30 minutes with an AI-generated course should not have to fight the voice — it should feel like listening to a trusted expert, not a robot reading slides.
But evaluating audio quality is notoriously hard. Most TTS benchmarks use short sentences, lab-controlled conditions, and metrics like WER that optimize for robotic over-enunciation. None of them test what we care about: long-form engagement, prosodic rhythm across a 15-minute lesson, and how naturalness degrades at scale.
So we built our own. Over six weeks, we ran a blind evaluation study on 18,400 audio samples across six production TTS providers. Here is what we found.
Results
The table below shows overall results across all providers. Rankings are based on the primary MOS naturalness score from the blind human evaluation study. All scores are means across the full 920-passage corpus.
Key findings
Beyond the headline MOS scores, three findings stood out as particularly significant for audio learning applications:
Long-form quality doesn't degrade
Every competitor showed a measurable drop in naturalness scores between their 2-minute and 20-minute ratings. ElevenLabs v3 — the strongest competitor — dropped from 4.02 to 3.61, a 10.2% degradation over the course of a full lesson. SUN's long-form score held at 4.28, within margin of error of its short-form baseline.
This matters enormously for courses. A learner isn't evaluating a 30-second clip — they're spending an hour with a voice. Naturalness drift is one of the main reasons learners abandon AI-generated content.
Conclusion
This benchmark was built for one reason: to make sure we are making decisions about audio quality with honest data, not assumptions. The results gave us confidence — and they gave us specific areas to keep improving.
We're publishing the full methodology, scoring rubrics, and a subset of anonymized sample pairs so other researchers and developers can build on it. The full dataset is available on request for academic use.