Empathy Engine

Giving AI a Human Voice

Prepared for the Darwix AI Evaluation assignment


K V Jaya Harsha | Personal Website | GitHub | LinkedIn


Approach 1 — Implemented (This Demo)

Uses the Inworld TTS API with dynamic vocal parameter modulation driven by a HuggingFace emotion classifier:

  • Emotion Detectionj-hartmann/emotion-english-distilroberta-base — 7 classes: joy, surprise, anger, disgust, fear, sadness, neutral
  • Intensity Scaling — Model confidence (0–1) linearly scales vocal params — "I'm okay" sounds different from "THIS IS AMAZING"
  • Vocal ModulationspeakingRate (speed) + temperature (expressiveness) — both dynamically computed per emotion
  • Emotion Mapping — Each emotion has a defined base rate + temperature, scaled by intensity score
  • TTS Engine — Inworld TTS inworld-tts-1.5-max, voice: Clive, output: .mp3

Approach 2 — Theoretical (Research-Grade)

Architecturally superior — validated in research, not implemented here due to time and compute constraints:

  • ModelF5-TTS with conditional flow matching, learns the full distribution of human speech prosody
  • Emotion Transfer — Zero-shot emotion cloning from reference audio — the model hears the target emotion and replicates it
  • Why it's better — Flow matching captures micro-variations in pitch, rhythm, and timbre that API parameters cannot replicate
  • Papers — F5-TTS: A Fairytale for Flow-matching-based TTS (Chen et al., 2024) · Voicebox (Meta AI, 2023)

This approach remains theoretical in this submission but is well-validated in published research, and would be the production-grade choice given sufficient time and GPU resources.


Quick Samples

Emotion to Voice Mapping

Emotion Rate Temp
Joyful 1.25–1.50 1.3–1.7
Surprised 1.20–1.40 1.4–1.75
Fearful 1.10–1.25 1.2–1.45
Neutral 1.00 1.00
Angry 0.75–0.85 0.35–0.50
Disgusted 0.70–0.80 0.45–0.55
Sad 0.60–0.70 0.35–0.45

Rate = speed · Temp = energy

Results


Stack: Python · Gradio · HuggingFace Transformers · Inworld TTS API · httpx

Darwix AI Evaluation · 2025 · K V Jaya Harsha