An AI-generated podcast: how we built it and what we've learned
A few days ago I published the first episodes of El podcast de Sergio and El informativo on Apple Podcasts. Neither was recorded by a human.
The entire process —finding the topic, writing the script, synthesising the voice, assembling the audio and publishing it— is handled by an AI agent. Here’s how it works and what we’ve learned along the way.
Why
Not because it’s the most comfortable way to make a podcast. I did it because I wanted to explore how far synthetic voice quality can go in Spanish, and because building an automated production pipeline seemed like an interesting technical problem. You choose the topic and review the script — the agent handles the rest.
The script
Claude Code searches for the day’s news or the episode topic, synthesises it, expands it with context and background, and generates the text structured by segments. The result isn’t perfect —it needs reviewing and adjusting— but it’s a solid starting point that takes seconds instead of hours.
Generating the audio
The script text is split into segments and each one is sent separately to the ElevenLabs API. The reason is technical: voice synthesis models lose consistency on very long texts, so it’s better to work paragraph by paragraph and then join the resulting audio files. In debate episodes, each character’s turn is an independent segment with its own voice. At the end, ffmpeg concatenates them into a single MP3.
SSML —the markup language for voice synthesis— was more useful with Google TTS than with ElevenLabs. With Google, tags like <break> controlled pauses and <p> separated paragraphs. With ElevenLabs the model interprets text more naturally and barely needs additional instructions: full stops and paragraph breaks already give it enough context.
The voices: a journey
This is where most of the work has been. We tried in order:
- macOS say: ruled out in two minutes. Robotic, no nuance.
- OpenAI TTS: good quality, but pronounces Spanish with an English accent on proper nouns and technical terms. Unacceptable.
- Google Cloud TTS Neural2: a big leap. Decent Spanish accent, within the free tier. It was our voice for the first episodes.
- Google Cloud TTS Chirp 3 HD: even better. Voices
Orus(host) andAchernar(guest). We learned that SSML<emphasis>makes the voice sound like a different person —avoid it—, and that English brand names at the start of a sentence get an English accent. - ElevenLabs Guillermo + Jaiska: the definitive leap. Natural Peninsular Spanish accent, intonation that doesn’t sound synthetic. From episode 5 onwards, this is what we use.
Publishing
Each episode updates the RSS feed, the website and the player. Apple Podcasts reads the feed automatically and publishes without manual intervention. The only work left is choosing the topic and reviewing the script.
What doesn’t work yet
The voice doesn’t improvise. There are no natural pauses between ideas. The conversational rhythm in debate episodes is believable but not perfect —an attentive listener will notice it’s AI. For news content it works very well; for debate there’s room for improvement.
What does work
Consistency. I can publish a daily news bulletin without any effort. The script takes longer to review than to generate. And ElevenLabs quality in Spanish is, today, well above what I expected.
What’s coming
We’ll keep exploring and improving with each episode. Better scripts, better rhythm, more natural pauses between speaking turns. And we have our sights set on the next step: the video podcast. Avatars, faces, movement. If the voice already sounds good, the next challenge is making it look just as good.