speechd-discuss
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Speechd] KTTS and SpeechD integration


From: Gary Cramblitt
Subject: [Speechd] KTTS and SpeechD integration
Date: Mon Sep 4 09:59:48 2006

Hynek and I have been discussing integration of the KDE Text-to-Speech System 
(KTTS) and Speech Dispatcher.  If this could be done, it would offer several 
advantages:

  1.  KDE users would have TTS capability from boot-up to shutdown.  
(Currently, KDE must be running before KTTS can produce any speech.)

  2.  KDE users would have TTS capability in terminal (character cell) apps.

  3.  We could unify our efforts.  If a new voice or synth became available, 
we need only enhance one package.

  4.  SpeechD performance and latency is better than KTTS.

The eventual goal, if it can be achieved, is to eliminate the KTTS backend 
(called kttsd), and replace it with SpeechD.  KTTS would then provide a GUI 
frontent for configuration of SpeechD, as well as additional capabilities, 
such as:

  1.  Ability for users to interactively pause, rewind, advance, or stop 
speech output.

  2.  Integration with the KDE notification system ("New mail has arrived.").

  3.  Text substitution and filtering.  (The IRC message "<PhantomsDad> Hello" 
becomes "PhantomsDad says Hello".)

  4.  Document conversion, such as HTML to SSML, PDF to SSML, etc.

Towards this goal, I sat down to write a SpeechD plugin for KTTS, but 
immediately ran into some roadblocks.  I'd like to explain these roadblocks 
so the SpeechD team can consider possible changes to SpeechD.

KTTS uses a plugin architecture for synthesizers.  The ideal plugin can 
asynchronously synthesize a message, notify KTTS when it is completed, and 
return a wav file.  If a synth cannot return a wav file, the next ideal 
plugin asynchronously speaks a message, sending directly to audio device, and 
notifies KTTS when the speech output is finished.  The next ideal plugin 
synchronously synthesizes a message and returns a wav file.  The least ideal 
plugin synchronously synthesizes a message and sends it directly to the audio 
device.  In order not to block KTTS or KDE apps, synchronous plugins are run 
in a separate thread.

SpeechD doesn't fall into any of these models.  It does not return a wav file.  
More seriously, it always runs asynchronously but does not notify when speech 
of a message has completed.

KTTS parses text into individual sentences and sends them one at a time to the 
synth plugin.  This is key in order to provide:

  1.  Ability to advance or rewind.
  2.  Ability to intermix with higher-priority messages.
  3.  Ability to change voice or synth in the middle of a long job.
  4.  Notification to apps of start/end of each sentence as well as text job 
as a whole.

Now SpeechD has its own priority and queueing system, so my next approach was 
to forego these capabilities and immediately send all messages to SpeechD.  
In addition to losing the capabilities listed above, this would also mean 
that KTTS users could not combine SpeechD with other KTTS plugins, as speech 
from the other plugins would either block while SpeechD is speaking, or talk 
simultaneously, depending upon their PC's audio capabilities.

KTTS provides 4 types/priorities of messages.  In order of priority (highest 
to lowest) they are:

  Screen Reader.  Interrupts all other messages, including other Screen Reader 
outputs.  Not a queue; there is only one Screen Reader output at a time.

  Warning.  Interrupts all lower-priority messages.  Is a queue, so does not 
interrupt other Warnings.

  Message.  Interrupts messages of type Text.  Also a queue.

  Text.  Interrupted by all other message types.  A queue.

Notice that none of these message types discard other messages except for 
Screen Reader, which only discards other Screen Reader messages.

So I began looking at the message types/priorities in SpeechD API to see how 
the KTTS message types would map onto them.  Since only Screen Reader 
discards other messages -- and only discards other Screen Reader messages, I 
immediately eliminated all the SpeechD message types that can be discarded -- 
namely 'Text', 'Notification', and 'Progress'.  This left only 'Message' and 
'Important'.  So it appears I could map KTTS 'Text' messages to SpeechD 
'Message' messages, and KTTS 'Message' and 'Warning' messages to SpeechD 
'Important' messages.  (We have considered eliminating 'Warning' type 
messages from KTTS anyway, so mapping both 'Warning' and 'Message' onto 
'Important' would not be a hardship.)  The following table summarizes:

KTTS Type                SpeechD Type
----------------                 ---------------------
Text                          Message
Message                    Important
Warning                     Important
Screen Reader           ?

Now what to do about Screen Reader?  The SpeechD message type that behaves 
most like KTTS Screen Reader is 'Text', but 'Text' messages are lower 
priority than 'Message' messages.  Furthermore, 'Text' messages are discarded 
by 'Message' messages, but strangely, not discarded by 'Important' messages.

Now it is possible I'm not reading the SpeechD API correctly.  It may be that 
I am misinterpreting the word "cancel" in the docs.  Under 'Important', it 
says

--
When a new message of level `important' comes during a message of another 
priority is being spoken, this message other message is canceled and the 
message with priority `important' is said instead. Other messages of lower 
priorities are either postponed (priority `message' and `text') until there 
are no messages of priority important waiting or canceled (priority 
`notification' and `progress'.
--

Then under 'Message' type it says

--
If there are messages of priority `notification', `progress' or `text' waiting 
in the queue or being spoken when a message of priority `message' comes,
these are canceled.
--

Here, I interpret "canceled" as meaning discarded.  Even if I have that wrong, 
and "canceled" just means postponed, it doesn't matter because 'Text' 
messages are of lower priority than 'Message' or 'Important' and therefore 
are not suitable for KTTS Screen Reader types.

So what I need is a message type like 'Important', but which interrupts and 
discards itself.  I thought about trying to use the SSIP CANCEL command to 
simulate such a message type, but since I have no way of knowing what kind of 
message SpeechD is currently speaking, that won't work.

Stopping for a moment and reflecting on these issues, I came to the 
realization that SpeechD has a priority system that is ideal for Screen 
Readers, but not so good for speaking longer texts, such as web pages, pdf 
documents, or ebooks, while still providing interruption by higher-priority 
messages.  The 'Text', 'Notification', and 'Progress' types are ideal for 
screen readers, but strangely are of lower priority than 'Important' or 
'Message'.  What seems to be missing is a "long text" type that is of lower  
priority than 'Text', 'Notification', and 'Progress', but is never discarded 
(unless application specifically cancels it.)

Given these issues, I cannot presently move forward with integrating SpeechD 
with KTTS.  If I had my way, SpeechD would offer callbacks to notify when 
messages have been spoken.  This would allow me to immediately write a plugin 
for KTTS with the least amount of disruption to the existing KTTS 
architecture and API.  It would also allow us to migrate the entire kttsd 
backend towards using SpeechD, although some additional changes to both APIs 
would be needed in order to accomplish that.

Thanks for listening.

-- 
Gary Cramblitt (aka PhantomsDad)
KDE Text-to-Speech Maintainer
http://accessibility.kde.org/developer/kttsd/index.php


reply via email to

[Prev in Thread] Current Thread [Next in Thread]