[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RFC: TTSAPI: First subjects

From: Hynek Hanke
Subject: RFC: TTSAPI: First subjects
Date: Fri, 28 Jan 2011 17:45:37 +0100

On 22.1.2011 22:25, Michael Pozhidaev wrote:
> Let see some subjects to discuss in TTS API document.

Hello Michael,

first of all, thanks again for your work on this. I think
it is very useful.

> These questions are actual as base for TTS processing daemon
 >interface only.

I'm not very sure what you mean, but in principle there is a chain
from the TTS system through the emulation layers towards
the integration layer (which manages multiple TTS) and out
from that layer towards higher level (such as Speech Dispatcher,
priority handling etc). E.g. in current implementation, this is
between TTS and the module and between module and the server

I think the exact same API (TTS API) can be used in all
these interfaces. There is no reason to use different APIs.

> Let see the structure for voice discovery:
> typedef struct {
>      /* Voice discovery */
>      /* Prosody parameters */
>      bool_t can_set_rate_relative;
>      bool_t can_set_rate_absolute;
>      bool_t can_get_rate_default;
> Are there any ideas, why tTS daemon can be unable to provide such
> functions?

Some TTS can set only relative rates (Festival) while some can
set only/also absolute rates (eSpeak). The upper layers must know this
to be able to provide emulations if the client asks for the other

This is consistent with the definitions in SSML. Both relative
and absolute rate settings are justified from client side point
of view. The same goes for pitch.

> Requested TTS may be unable to change rate at all, but
> relative or absolute  selection always may be implemented in daemon.

Do you mean in intermediate layers not in the TTS
itself? The idea of TTS API is that every time an intermediate
layer is able to emulate some functionality that is not actually
provided by the lower components, it will report the functionality
as available to upper components.

I'm not sure all these emulations will always be possible.
Going from relative to absolute is often not possible and
whether it can be emulated approximately depends on the
base levels and speed-up algorithms. Consider that these
might be different for different voices.

High level applications will use a high level API (e.g. the
one of Speech Dispatcher) not TTS API directly and they should
not *need to* examine the detailed Voice Discovery options
at all.

> And the same for pitch and volume. Next, I suggest replace
> all punctuation fields by one: can requested TTS mark punctuation
> (exclaim, question, etc) by voice intonation. All other punctuation
> processing can be easely done on client side, it is not TTS concern.

All higher level components than the TTS itself (e.g. the
client application) need to avoid modifying the contents
of the messages (the text) unless absolutely necessary for
emulations, because:

  1) Text modifications are language specific and not at all
     easy to do.

  2) Any modifications to the text before the TTS system
     destroy the original structure and thus break syntax
     and context analysis in the TTS. E.g. if you modify
     the '.' character for the '(dot)' text, the TTS system
     has no longer a chance to get intonation right.

  3) The text might not be plain, but it might be SSML.
     If higher levels are to modify it, that means a dependency
     on an SSML parser and non-trivial processing.

Client side applications should be able to influence the
way punctuation marks are announced (e.g. 'stop' instead
of 'dot' etc.) by modifying the table used for the given
language (possibly on the fly using TTS API) and the TTS
itself should use this requested substitution at the right
place in the text processing chain.

I don't know now whether this is already defined in TTS API
(and should be).

> For example, punctuation in Russian often is processing by another
> language. Russian users often prefer to listen punctuation by Russian
> even in English text.

This is an interesting concern, because it involves multi-lingual
sentences. E.g. if the punctuation mark is a coma (,), the sentence
must first be analyzed for intonation, then it must be synthesized,
then coma is replaced with the russian word.

SSML also has support for this case, but I don't know if we
currently have enough support in TTS API. It would basically
mean that the punctuation substitution table must also support
a language attribute. This can be a NICE TO HAVE functionality.

> Suggest to add:
>      bool_t can_speak_punctuation_by_intonation;

Do you mean like over-emphasized intonation? That could
be interesting. I recommend to add it as a NICE TO HAVE.

>      bool_t can_set_number_grouping;
> Numbers processing is the same as for punctuation. All required numbers
> processing can be easely done on client side too, so TTS daemon may not take
> care about numbers.

As explained above, not really. Translation of numbers to text is a 
complicated matter and deeply language specific. Then we have cardinal
numbers, ordinal numbers, currencies etc. This is really complex
thing to handle separately in each client for each language.

Again, this is here for consistency with the SSML specification too.
I don't think number grouping is something a lot of users care
about, but I might be wrong (thinking about spreadsheet processors,
calculators and such). There must be a reason why SSML includes it.

>      bool_t can_say_text_from_position;
> I think, not needed.

I might be wrong, but I think this is necessary to be able
to rewind in the text being synthesized (e.g. go to next paragraph,
repeat last sentence) which was a requirement from some parties
who participated on TTS API.

You can not rely on doing that in audio, because audio
might not be available yet and it would also mean that all
the intermediate (completely unnecessary) text must be

It is also necessary to be able to support the "long text" priority
as required by KTTSD/Jovie. They basically want a priority which they
can use to send in a long text, such as a chapter of a book. If a 
message of priority IMPORTANT arrives, it must be spoken immediatelly
(it may announce something as 'battery critical' or something). 
Currently, priority IMPORTANT message cancels everything and gets
spoken. The expected behavior for "long text" priority messages would
be that they get PAUSED by priority important messages and then
will continue.

To be able to PAUSE and RESUME text from position, if you do not have
a multi-threaded synthesizer which supports this automatically, you
need to continue synthesizing the text from the position where audio
was interrupted.

This may sound easy to do on client side, but is not so. You can
not cut the original text in the middle and send only a part of it
for the new synthesis after RESUME if you do not want to break 
punctuation and SSML validity. Currently, Speech Dispatcher does
it to be able to do PAUSE/RESUME, it just cuts the message where
it was paused and tries to restore SSML somehow, but I consider
this broken. (Not easy to fix, waiting for TTS API say_text_from_position :)

Sorry, this is a complicated matter, I hope I could explain it well.

>      bool_t can_retrieve_audio;
>      bool_t can_play_audio;
> Since TTS daemon always functioning in data retrieve mode, these two
> fields, I suppose, are needless.

Client applications want be able to both play and retrieve audio.
Again, this was a requirement on the API not just from our side
and we repeatedly receive requests on Speech Dispatcher to be able
to retrieve the synthesized audio and save it in a wav file.

A user typically wants to synthesize his book into an .mp3
file, then listen to it on the bus or something like that.

 From the TTS system point of view, the request is on being
able to retrieve audio and the ability to play audio on its
own we do not consider important. I think this is described
in Appendix D Requirements on the synthesizers.

But then you also have to deal with subsystems that can only
play and not retrieve. In current Speech Dispatcher for example,
this is the case with some Generic modules.

>      bool_t can_report_events_by_sentences;
>      bool_t can_report_events_by_words;
>      bool_t can_report_custom_index_marks;
> OK, very needed things, but in terms of daemon we are not talking about
> events reporting. Audio data can be divided onto several chunks at
> requested positions. So, I suggest to replace word "report" by word "mark".

Well, yes. If audio is retrieved, events are marked (in TTS API we 
suggest not inline in the audio but separate with timecodes, but this
doesn't matter much). If audio is played, events are reported.

Now that I consider it, I think mark can also be used in the
other context while 'report' really implies only one, so I think
as you propose (mark) is better.

>      bool_t can_defer_message;
> In terms of daemon not needed.

Defer is here a name for PAUSE. We considered calling it 'pause',
but decided against it because 'pause' implies too much relation with
audio. But to be able to do PAUSE on text being synthesized and
played, you cannot just do that on the audio. The text might not
be synthesized yet (and often will not be). So what you want to do
is at the same time stop further synthesis and pause audio.

Now there are two ways to proceed. Either you throw away the
synthesis in the TTS (and then use say_text_from_position) to
restart the synthesis later on RESUME, or, if supported, you tell
the synthesizer itself to defer the synthesis, but keep the
possibility to continue later.

The advantage of defering over completely throwing the synthesis
away is of course performance. E.g. if you synthesized a chapter
of a book in a higher quality voice, the difference might
be very significant. So if the synthesizer can do it efficiently,
the API should allow to make use of it.

>      bool_t can_parse_ssml
> OK, but SSML is going to be replaced by SABLE, it seems to me. May be
> can_parse_sable is better?

Is it so? It is quite some time since we were working on TTS API,
things might have changed. Could you please send some pointer on
resources claiming so?

The advantages of SSML are that:
   1) It is currently supported by the two most important FOSS
      TTS systems (eSpeak and Festival).
   2) We found a reasonable (recommended by the authors) way to extend it

The disadvantage is that some important things are missing,
so we need to use custom extensions

But of course we should consider this carefully. If SSML was going
to be replaced, we of course want to use SABLE. So this is actually
quite important.

>      bool_t supports_multilingual_utterances;
> OK, but I would like to restrict daemon to process only one language
> by one connection.

If a daemon can only do one language per connection, then
first of all, it can't support full SSML.

I think this is only NICE TO HAVE, but in some cases quite important.
There are sentences which contain more than one language, e.g. in
a textbook explaining something. Much more often, you will get a bigger
piece of text, e.g. a paragraph or a list of words in a vocabulary,
which contains more than one language.

Of course it is possible to break this text before handing it
into the TTS and then have it processed in different languages
in different connections, but this leads to all the problems
with modifying the text before the TTS, as described above.
I'm pretty sure we will have to implement this, because most
TTS currently do not support it. But this is an emulation and
should not be forced by the design, especially if the TTS
bellow actually /has/ support for multi-lingual utterances.

I hope you are not discouraged by my answers :) I think
it is very useful that you analyze the API and that we
get a chance to discuss and clear such points. I don't
think all the reasons for all the decisions are sufficiently
described in the document and also here we get a chance
to examine whether those decisions were not wrong (which
they of course might be -- we were more sure about some
that we were about others).

To be concrete: The thing that TTS systems should, whenever
possible, be able to analyze the original unmodified text I'm
very sure about. I'm less sure about the decisions about
the PAUSE/RESUME/DEFER logic and how TEXT/AUDIO/EVENTS interact

Your input is very welcome and please continue with your

Best regards,
Hynek Hanke

reply via email to

[Prev in Thread] Current Thread [Next in Thread]