Emotional Text-to-Speech
Humanising open-source TTS. Adding emotion to machines.
Most open-source TTS models produce monotone, flat audio. I fine-tuned Parler-TTS to accept emotion as an input parameter and built a voice consistency system on top. The fine-tuned model has 1000+ downloads on HuggingFace. Built at DeepSearch Labs.
The Problem
Open-source TTS is good at clarity. It's bad at emotion. Parler-TTS was trained on 45,000 hours of audiobooks, which optimise for clear narration, not for sounding happy, angry, or excited. If you want expressive speech, you're stuck with expensive proprietary APIs.
There's a second problem: Parler-TTS generates a random voice every time. Say you're generating three sentences of an ad. Each sentence sounds like a different person. There's no built-in way to keep the voice consistent across generations.
What I Built
Emotional Fine-Tuning
Collected 4 open-source emotional audio datasets. Used DataSpeech to extract audio features and convert them into natural language prompts with emotional context (e.g., "young man, excited, fast pace"). Fine-tuned Parler-TTS on this combined set. The model now accepts emotion as an input parameter. User testing across 20 volunteers: ~4/5 on perceived naturalness.
Voice Consistency
Trained a SpeechBrain voice classifier to label each random generation with the closest matching speaker ID. This means you can generate multiple sentences and they'll sound like the same person. It solved the core problem that made Parler-TTS unusable for multi-sentence output.
Agent Orchestration
Three CrewAI agents coordinate the ad creation pipeline: Script Creator generates copy, Art Director creates voice/tone direction, Script Refiner modifies specific sentences without changing the rest. A multi-layer validation system ensures edits don't bleed across sentences.
Evaluation
Dual-metric evaluation: cross-entropy loss for technical accuracy, plus Multi-Scale STFT loss for perceptual quality (spectral alignment with human perception). The model performs best on "happy" and "excited" emotions, weaker on "angry" and "frustrated" due to dataset imbalance.
Listening tests with 20 volunteers. Scored ~4/5 on naturalness. Users noted the fine-tuned model sounded noticeably more human than base Parler-TTS, especially on emotional content. One generation improvised a word that wasn't in the script and it sounded natural.
Stack
Next.js
FastAPI, Llama 3B
Fine-tuned Parler-TTS, SpeechBrain classifier, DataSpeech
CrewAI (3 agents: Script Creator, Art Director, Refiner)
Links