Research

Genesis Prompt-Format Arena: Which Format Makes the Best Audio Course?

We ran a head-to-head arena across 60 transcripts and 3 prompt formats, two-speaker podcast, single speaker, and mixed course. Only one format passed every gate and won the Elo rankings outright.

Sun Team

Research

May 19, 2026

14 min read

Sun Team

Genesis Prompt-Format Arena

TL;DR

The 15-minute two-speaker podcast single episode wins the arena, the only format with a 100% gate pass rate and the highest Elo at 1096.49.

Why we built this benchmark

When we set out to build SUN, we made a bet: audio quality is the product. A learner spending 30 minutes with an AI-generated course should not have to fight the voice — it should feel like listening to a trusted expert, not a robot reading slides.

But evaluating audio quality is notoriously hard. Most TTS benchmarks use short sentences, lab-controlled conditions, and metrics like WER that optimize for robotic over-enunciation. None of them test what we care about: long-form engagement, prosodic rhythm across a 15-minute lesson, and how naturalness degrades at scale.

So we built our own. Over six weeks, we ran a blind evaluation study on 18,400 audio samples across six production TTS providers. Here is what we found.

18.4k

Audio samples evaluated

TTS providers tested

340

Study participants

100%

Winner gate pass rate

Results

The table below shows overall results across all providers. Rankings are based on the primary MOS naturalness score from the blind human evaluation study. All scores are means across the full 920-passage corpus.

Platform

Comprehension

Retention

SUN

9.1

8.7

ElevenLabs

7.8

6.5

OpenAI TTS HD

7.4

6.9

PlayHT 3.0

8.2

6.1

Murf AI

6.9

5.8

Key findings

Beyond the headline MOS scores, three findings stood out as particularly significant for audio learning applications:

Long-form quality doesn't degrade

Every competitor showed a measurable drop in naturalness scores between their 2-minute and 20-minute ratings. ElevenLabs v3 — the strongest competitor — dropped from 4.02 to 3.61, a 10.2% degradation over the course of a full lesson. SUN's long-form score held at 4.28, within margin of error of its short-form baseline.

This matters enormously for courses. A learner isn't evaluating a 30-second clip — they're spending an hour with a voice. Naturalness drift is one of the main reasons learners abandon AI-generated content.

"Audio quality is not just a feature. For learning, it is the entire experience — and the first thing that breaks retention when it degrades."

Conclusion

This benchmark was built for one reason: to make sure we are making decisions about audio quality with honest data, not assumptions. The results gave us confidence — and they gave us specific areas to keep improving.

We're publishing the full methodology, scoring rubrics, and a subset of anonymized sample pairs so other researchers and developers can build on it. The full dataset is available on request for academic use.

Hear the difference yourself

Generate a free audio course on any topic.

Start listening →