speechd-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Post-synth speedup of speech


From: Bill Cox
Subject: Post-synth speedup of speech
Date: Mon, 2 Sep 2024 14:28:27 -0700

TL;DR: Can I send speechd maintainers a patch to integrate my libsonic speech-speedup algorithm as an optional post-process to synthesized samples?  This would let us listen at very high speeds to TTS voices that don't support higher speeds.

If devs on this list are interested, I'll generate some voice samples from Voxin voices (Tom, Evan, and Nathan American English voices).  I'll show them sped up with native speedup, compared to speeding up with libsonic.  I'll also provide samples at higher speeds they donn't support.

BTW, congrats on switching to a far more stable system where TTS synthesizers return audio samples, rather than play them directly!  This will have a strong positive impact, IMO, on speech-dispatcher stability long-term as the underlying audio systems continue to change rapidly.  Last I checked (2020?) everything was pulseaudio, and now they are pipewire?  Thank goodness we only have to change the audio interaction in one place.

A further benefit of the new architecture is it now gives us a single location to do speech post-processing if needed.  I personally listen at 3-5X speed at work, using Voxin's IBM TTS.  However, Voxin appears to no longer provide the venerable Eloquence voice, as it is known in JAWs.  The new voices provided by Voxin are nice, and I have them.  However, they don't  speed up much and don't sound great at their max speeds.  This is a substantial hit to my personal productivity, and I find listening to these voices annoying at their max speed, as I seem to have to fill in butchered phonemes from context, though it could be lack of training on the new voices.

While it sounds better to do good speech speedup in the TTS vocoder, when commercial vendors only focus on the sighted market (aka "the market"), high speed speech is often butchered or not available, which I believe is the case in current Voxin voices.

Is there enough interest in the concept to at least justify some sound tests?  If so, I'll send samples.

Thanks,
Bill

reply via email to

[Prev in Thread] Current Thread [Next in Thread]