speechd-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Speech Dispatcher roadmap discussion.


From: Bohdan R . Rau
Subject: Speech Dispatcher roadmap discussion.
Date: Wed, 15 Oct 2014 12:33:39 +0200

W dniu 2014-10-15 03:40, Trevor Saunders napisa?(a):
> On Mon, Oct 13, 2014 at 10:45:05AM +0200, Bohdan R. Rau wrote:
>>
>> COMPAT_MODE On|Off
>
> I don't really like on and off since it assumes we'll only change the
> protocol once.


It was only suggestion - for example there may be command like:

PROTOCOL <number>

But I think there will be only one protocol change, next changes and 
protocols would be obtained with CAPABILITY command.

>
> we can add functions spd_char_msgid etc which seems simpler to
> explain.

If we assume the new protocol will be used in new applications - I see 
no reason to add new function if application (and library) always knows 
which version of protocol we use.

>
> btw why is spd_wchar a thing at all :( it seems like spd_char should
> handle UTF-8 fine.

Of course - spd_char works fine (with some exceptions). But spd_wchar 
has nothing to UTF-8, it's used for direct Unicode codes, not for 
encoded strings. As for me - spd_char function should be realized as 
wrapper to spd_wchar, something like:

int spd_char(SPDConnection *conn, char *str)
{
     int chr=get_unicode_character(str);
     if (chr < 0) return -1;
     return spd_wchar(conn,chr);
}

Why? Because some modules may be inconsistent with the documentation. 
In theory we could put string of any length to spd_char and only first 
character will be said. In fact, espeak module says "null" if string is 
longer than one UTF-8 character.

But as spd_wchar seems to be completely broken today - it's theme for 
future discussion.


>> Also, there must be functions like:
>>
>> SPD_Callback *spd_register_callback(SPDConnection *conn, int event,
>> SPD_Callback *callback, void *user_data);
>> SPD_Callback *spd_unregister_callback(SPDConnection *conn,int 
>> event);
>>
>> Of course this function is valid only in no-compatibility mode!
>
> Well, you can only call it if you assume newer libspeechd than we 
> have
> today so I'm not sure what the point of caring about a compatibility 
> on
> vs off is.

Have you even see application without bugs? :)

>
>> 3. Module output capabilities
>>
>> SPEAK - module can speak
>> FETCH - module can return synthesized wave to server
>> FILE - module can save synthesized wave to file
>
> the second two are basically indistinguishable, so why have both?

Please be patient and wait for second part - I'll explain with details 
why.


>
>> 4. Module input capabilities
>>
>> SSML - module can fully play with SSML and index marks;
>> FLAT - module translates internally SSML into plain text. Index mark 
>> are
>> lost, pause/resume are not implemented.
>> PLAIN - module understands plain text (no SSML). Extra features 
>> (like
>> AUTOPAUSE and AUTOSYNC) are possible only in this mode.
>
> I'm not sure what the point in distinguishing between flat and plain 
> is,
> any module can rip out all the ssml bits.

Because in FLAT mode string sent to module may be different than string 
sent to speech-dispatcher by application. So offsets returned by 
AUTOPAUSE and AUTOSYNC will be completely unusable.

> Though maybe
> it makes sense to tell clients if a module can deal with ssml or not 
> I'm
> not really sure.

Yes. But if module has extra features usable only in PLAIN mode, 
application should have this information.



>> Server should never internally encode plain text into SSML if module 
>> reports
>> PLAIN and any of extra features (AUTOPAUSE, AUTOSYNC etc.) is 
>> enabled. Also,
>> server should never accept SSML data from application if extra 
>> features are
>> enabled (it's application bug).
>
> why?

Because requesting features which are known not possible it's bug - or 
we have different ideas what is bug :)

>
>> 5. Module extended capabilities:
>>
>> SYNC - valid only in SSML mode. 706 SYNCHRONIZED events will be 
>> fired only
>> if SYNC mode is enabled.
>>
>> AUTOSYNC - valid only in PLAIN mode. 707 SYNCHRONIZED event will be 
>> fired
>> only if AUTOSYNC mode is enabled. Requires simple NLP in module.
>
> these events are different how?

Both are intended for applications which needs information, which part 
of text is actually spoken. SYNC works in SSML mode and uses predefined 
(by application) index mark. AUTOSYNC works in PLAIN mode, and returns 
offsets, which may be used for example to highlight spoken text.

Example of application: multi-language epub reader. Application has 
only vague idea where is end of sentence, and if module (specialized for 
particular language) knows better - why not use it's knowledge?

>> Simple NLP (Natural Language Processor) must be able to 
>> automatically split
>> given text into sentences (or - if synthesizer can speak also parts 
>> of
>> sentences - phrases).

> I'm unconvinced, it seems like that's a problem synthesizer should
> already be solving, so why should we duplicate that?

Because synthesizers are for synthesis, not for dealing with gramatic 
problems.
Example: Mbrola is synthesizer (ie. Mbrola realizes DSP phase of TTS).

Of course - most synthesizers has some internal NLP, but it's used only 
for internal synthesizer's purposes. My Milena is exception, it uses 
something like:

while (*input_string) {
        char *sentence = get_sentence(&input_string);
        say(sentence);
        free(sentence);
}

So it's possible to get currently spoken sentence position from Milena, 
and we can use it to highlight spoken text or to determine byte offset 
where speech was paused.

But Milena is not synthesizer - in fact it's text-to-speech system with 
sophisticated NLP specialized for only one language, and backend 
synthesizers may be different (currently Mbrola and Ivona are 
implemented).

I know my suggestions may be a little strange, but you have to take 
into account that I want to change the way of thinking about 
speech-dispatcher.

Currently:
speech-dispatcher is used by visual impaired users, and as speech 
backend for screenreaders.

My dream:

Visual impaired users are very important for speech-dispatcher 
developers, but speech-dispatcher should also be used as general purpose 
speech synthesis backend for different applications (like SAPI in 
Windows). Screenreader is example of very important application, but 
it's not only application using speech-dispatcher.

Example:
Imagine well-sighted eighteenwheeler driver, who carries several cases 
of beer from Dallas to New Orleans reading the long email sent to him by 
his fiancee :)

> Trev

ethanak




reply via email to

[Prev in Thread] Current Thread [Next in Thread]