speechd-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: making Linux a hospitable place for TTS engines like Voxin


From: Samuel Thibault
Subject: Re: making Linux a hospitable place for TTS engines like Voxin
Date: Sat, 5 Dec 2020 18:32:18 +0100
User-agent: NeoMutt/20170609 (1.8.3)

Hello,

Ok, so let's discuss!

Bill Cox, le mer. 02 déc. 2020 12:54:17 -0800, a ecrit:
> Specifically, modules are required to render speech through the sound system,
> rather than generating speech samples.

Yes. I do not know the historical rationale for this. Possibly it was
meant to avoid yet another data passing between processes (we already
have the ssip client to the ssip server, then to the output module, then
to the sound audio server), each of which adds some latency. Possibly
this does not matter that much any more with nowadays' faster machines.

Possibly it is because it was thought possible that some synthesis would
not support producing the samples rather than playing.

But the TODO file actually lists moving audio to the server :)

> No speech synth for Linux I know of is incapable of returning speech spamples
> rather than playing them.

This is not completely theoretical, at least Kali didn't know to fill
a buffer with samples, you'd have to make it write a file, and reread
that. Also, Luke mentioned in the TODO file that he knows one synth
who's licensing model doesn't allow for direct audio retrieval.

That being said, we can keeping supporting modules that produce their
own audio output (e.g. the generic modules for which it's usually hard
to avoid).

> This greatly complicates them,

I would have agreed a few years ago, but this is not true any more,
thanks to the factorization I made recently. In the espeak module, the
only pieces of audio management left are

#include "spd_audio.h"
MOD_OPTION_1_INT(EspeakAudioChunkSize)
MOD_OPTION_1_INT(EspeakAudioQueueMaxSize)
MOD_OPTION_1_INT_REG(EspeakAudioChunkSize, 2000);
MOD_OPTION_1_INT_REG(EspeakAudioQueueMaxSize, 20 * 22050);
[...]

ret = module_speak_queue_init(EspeakAudioQueueMaxSize, status_info);
[...]

if (!module_speak_queue_before_synth())
[...]

module_speak_queue_stop();
[...]

module_speak_queue_pause();
[...]

module_speak_queue_terminate();
module_speak_queue_free();
[...]

AudioTrack track = {
        .bits = 16,
        .num_channels = 1,
        .sample_rate = espeak_sample_rate,
        .num_samples = numsamples,
        .samples = wav + (*sent),
};
gboolean result = module_speak_queue_add_audio(&track, SPD_AUDIO_LE);
[...]

if (module_speak_queue_stop_requested()) {
[...]

if (module_speak_queue_before_play())
[...]

module_speak_queue_add_mark(events->id.name);
[...]

module_speak_queue_add_sound_icon(events->
[...]

module_speak_queue_add_end();
[...]

> I'll send you an off-list email showing how I've simplified the
> modules.  If folks want to look, here's [1]my Espeak module vs the
> one in [2]Speech Dispatcher.

They are not comparable: your module does not support

- sound icons
- character/key spelling
- setting the synth volume
- setting the pitch range
- selecting an appropriate voice according to the requested locale
- index marking

Notably, index marking is quite a beast to support, but it is really
important for a lot of users. Getting it right is tricky, and is the
reason for the seemingly long list of module_speak_queue_* calls.  In
the end if you implement these features, you will end up basically with
the same complexity.

Note that speech-dispatcher *also* supports much simpler cases: for a
given speech synthesis, an initial completely synchronous speechd module
is possible by just synthesizing+playing in the module_speak() function
and be done. It would not support indexing etc. but that can be added
progressively, and yes, that also makes the module progressively more
complex, but that's inherent to supporting indexing.

> and makes these binarys specific to not only the distro, but the
> distro version. 

That, however, is a very convincing argument. Making it simple for
vendors to just ship a binary to a known place, whatever the distro and
version, can simplify things a lot for them.

We however have to be very careful with the protocol for the data
exchange with modules. Notably if you want to support indexing in
speechswitch, you'd have to break compatibility somehow, or introduce
backward/forward compatibility management complexity (which might not
even be possible if nothing was prepared in the initial design for the
server and the module to announce what they actually support).


So, now, where do we start. We need to specify the extension of the
module protocol to transfer audio from the module to the server. AIUI,
what we would want is to add a new "SERVER" case to the AUDIO
configuration command, that speech-dispatcher would try first by
default, and revert to alsa/pulse/etc. if that is rejected. When
accepted, the module can emit its audio snippets and index marks as SSIP
events.

Also, I have been thinking about simplifying modules into not using
a separate speak thread. Ideally modules should only care about
synchronously calling the synthesizing function from module_speak,
possibly piece by piece or with a periodic callback, and synchronously
calling some functions to determine whether stopping is wanted etc. The
current way (main()'s while(1) loop managing all communications) make
it difficult for modules to juggle with events. We can probably rework
this. Also I am thinking that this should be rewritten with a BSD
licence, so people can use it as a skeleton for their proprietary module
implementation.

Anything I would have forgotten?

Samuel



reply via email to

[Prev in Thread] Current Thread [Next in Thread]