Home / Companies / Hume / Blog / Post Details
Content Deep Dive

Disentangling Emotion from Voice: A Cross-Product Sampling Approach for Expressive Voice Data

Blog post from Hume

Post Details
Company
Date Published
Author
Hoon Shin
Word Count
1,783
Language
English
Hacker News Points
-
Summary

Humans possess the ability to separate emotion from vocal delivery, an area where voice models still struggle due to the entanglement problem, where emotions and delivery are learned as fixed pairings from natural training data. Common emotional cues like anger with shouting or boredom with monotone delivery dominate training datasets, causing models to sound narrow or unrealistic when tasked with more subtle combinations like angry whispers or energetic boredom. Recent architectural advancements, such as factorized codecs and self-distillation, aim to address this issue but remain limited by the data distribution. To provide a more robust solution, cross-product sampling is proposed, encouraging models to learn emotions and delivery as independent dimensions by sampling from a grid of emotion and voice categories. This method aims for better coverage of rare combinations, such as confident disappointment or articulate pain, by collapsing attributes into parent categories and applying z-normalization to balance attribute prominence. Cross-product sampling has shown promising results in maximizing expressivity and reducing mutual information, thus offering a structured way to expand expressive capabilities in voice models, and highlights the importance of diverse training data for achieving independent control over emotion and delivery.