TTS Parler-TTS SpeechBrain CrewAI

Emotional Text-to-Speech

Humanising open-source TTS. Adding emotion to machines.

Most open-source TTS models produce monotone, flat audio. I fine-tuned Parler-TTS to accept emotion as an input parameter and built a voice consistency system on top. The compiled emotional audio dataset has 2,000+ downloads on HuggingFace. Built at DeepSearch Labs.

GitHub ↗ Full Report (PDF) ↗ Slides (PDF) ↗

The Problem

Open-source TTS is good at clarity. It's bad at emotion. Parler-TTS was trained on 45,000 hours of audiobooks, which optimise for clear narration, not for sounding happy, angry, or excited. If you want expressive speech, you're stuck with expensive proprietary APIs.

There's a second problem: Parler-TTS generates a random voice every time. Say you're generating three sentences of an ad. Each sentence sounds like a different person. There's no built-in way to keep the voice consistent across generations.

What I Built

Emotional Fine-Tuning

Collected 4 open-source emotional audio datasets. Used DataSpeech to extract audio features and convert them into natural language prompts with emotional context (e.g., "young man, excited, fast pace"). Fine-tuned Parler-TTS on this combined set. The model now accepts emotion as an input parameter. User testing across 20 volunteers: ~4/5 on perceived naturalness.

Voice Consistency

Trained a SpeechBrain voice classifier to label each random generation with the closest matching speaker ID. This means you can generate multiple sentences and they'll sound like the same person. It solved the core problem that made Parler-TTS unusable for multi-sentence output.

Agent Orchestration

Three CrewAI agents coordinate the ad creation pipeline: Script Creator generates copy, Art Director creates voice/tone direction, Script Refiner modifies specific sentences without changing the rest. A multi-layer validation system ensures edits don't bleed across sentences.

Evaluation

Dual-metric evaluation: cross-entropy loss for technical accuracy, plus Multi-Scale STFT loss for perceptual quality (spectral alignment with human perception). The model performs best on "happy" and "excited" emotions, weaker on "angry" and "frustrated" due to dataset imbalance.

Listening tests with 20 volunteers. Scored ~4/5 on naturalness. Users noted the fine-tuned model sounded noticeably more human than base Parler-TTS, especially on emotional content. One generation improvised a word that wasn't in the script and it sounded natural.

Stack

frontend

Next.js

backend

FastAPI, Llama 3B

tts

Fine-tuned Parler-TTS, SpeechBrain classifier, DataSpeech

agents

CrewAI (3 agents: Script Creator, Art Director, Refiner)

Links

code

GitHub ↗

report

Full Technical Report (PDF) ↗

demo

Live Demo ↗