/plushcap/analysis/assemblyai/review-text-free-prosody-aware-generative-spoken-language-modeling

Review - Text-Free Prosody-Aware Generative Spoken Language Modeling

What's this blog post about?

The paper "Text-Free Prosody-Aware Generative Spoken Language Modeling" introduces a novel approach to generative spoken language modeling by incorporating prosody as a feature. Previously, text has been the intermediate representation between speech inputs and NLP analyses, but this work suggests that it is suboptimal due to being a lossy medium for capturing speech. By directly modeling in the spoken language domain without cascading through text, the authors aim for a more optimal representation. They leverage self-supervised acoustic units representing phonetic content and quantized, speaker-mean normalized log F0 bins together with unit durations as input streams, which are modeled jointly with a transformer language model. The paper's findings show that prosodic input features improve both content and prosody modeling. This research direction is promising but still exploratory, indicating the potential for spoken language modeling to move towards end-to-end approaches in the future.

Company
AssemblyAI

Date published
Sept. 24, 2021

Author(s)
Steven Hillis

Word count
333

Hacker News points
None found.

Language
English