Empathy Engine

Giving AI a Human Voice

Prepared for the Darwix AI Evaluation assignment

K V Jaya Harsha | Personal Website | GitHub | LinkedIn

Approach 1 — Implemented (This Demo)

Uses the Inworld TTS API with dynamic vocal parameter modulation driven by a HuggingFace emotion classifier:

Emotion Detection — j-hartmann/emotion-english-distilroberta-base — 7 classes: joy, surprise, anger, disgust, fear, sadness, neutral
Intensity Scaling — Model confidence (0–1) linearly scales vocal params — "I'm okay" sounds different from "THIS IS AMAZING"
Vocal Modulation — speakingRate (speed) + temperature (expressiveness) — both dynamically computed per emotion
Emotion Mapping — Each emotion has a defined base rate + temperature, scaled by intensity score
TTS Engine — Inworld TTS inworld-tts-1.5-max, voice: Clive, output: .mp3

Approach 2 — Theoretical (Research-Grade)

Architecturally superior — validated in research, not implemented here due to time and compute constraints:

Model — F5-TTS with conditional flow matching, learns the full distribution of human speech prosody
Emotion Transfer — Zero-shot emotion cloning from reference audio — the model hears the target emotion and replicates it
Why it's better — Flow matching captures micro-variations in pitch, rhythm, and timbre that API parameters cannot replicate
Papers — F5-TTS: A Fairytale for Flow-matching-based TTS (Chen et al., 2024) · Voicebox (Meta AI, 2023)

This approach remains theoretical in this submission but is well-validated in published research, and would be the production-grade choice given sufficient time and GPU resources.

Emotion	Rate	Temp
Joyful	1.25–1.50	1.3–1.7
Surprised	1.20–1.40	1.4–1.75
Fearful	1.10–1.25	1.2–1.45
Neutral	1.00	1.00
Angry	0.75–0.85	0.35–0.50
Disgusted	0.70–0.80	0.45–0.55
Sad	0.60–0.70	0.35–0.45

Results

Stack: Python · Gradio · HuggingFace Transformers · Inworld TTS API · httpx

Darwix AI Evaluation · 2025 · K V Jaya Harsha