[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Speechd] KTTS and SpeechD integration
From: |
Gary Cramblitt |
Subject: |
[Speechd] KTTS and SpeechD integration |
Date: |
Mon Sep 4 09:59:48 2006 |
Hynek and I have been discussing integration of the KDE Text-to-Speech System
(KTTS) and Speech Dispatcher. If this could be done, it would offer several
advantages:
1. KDE users would have TTS capability from boot-up to shutdown.
(Currently, KDE must be running before KTTS can produce any speech.)
2. KDE users would have TTS capability in terminal (character cell) apps.
3. We could unify our efforts. If a new voice or synth became available,
we need only enhance one package.
4. SpeechD performance and latency is better than KTTS.
The eventual goal, if it can be achieved, is to eliminate the KTTS backend
(called kttsd), and replace it with SpeechD. KTTS would then provide a GUI
frontent for configuration of SpeechD, as well as additional capabilities,
such as:
1. Ability for users to interactively pause, rewind, advance, or stop
speech output.
2. Integration with the KDE notification system ("New mail has arrived.").
3. Text substitution and filtering. (The IRC message "<PhantomsDad> Hello"
becomes "PhantomsDad says Hello".)
4. Document conversion, such as HTML to SSML, PDF to SSML, etc.
Towards this goal, I sat down to write a SpeechD plugin for KTTS, but
immediately ran into some roadblocks. I'd like to explain these roadblocks
so the SpeechD team can consider possible changes to SpeechD.
KTTS uses a plugin architecture for synthesizers. The ideal plugin can
asynchronously synthesize a message, notify KTTS when it is completed, and
return a wav file. If a synth cannot return a wav file, the next ideal
plugin asynchronously speaks a message, sending directly to audio device, and
notifies KTTS when the speech output is finished. The next ideal plugin
synchronously synthesizes a message and returns a wav file. The least ideal
plugin synchronously synthesizes a message and sends it directly to the audio
device. In order not to block KTTS or KDE apps, synchronous plugins are run
in a separate thread.
SpeechD doesn't fall into any of these models. It does not return a wav file.
More seriously, it always runs asynchronously but does not notify when speech
of a message has completed.
KTTS parses text into individual sentences and sends them one at a time to the
synth plugin. This is key in order to provide:
1. Ability to advance or rewind.
2. Ability to intermix with higher-priority messages.
3. Ability to change voice or synth in the middle of a long job.
4. Notification to apps of start/end of each sentence as well as text job
as a whole.
Now SpeechD has its own priority and queueing system, so my next approach was
to forego these capabilities and immediately send all messages to SpeechD.
In addition to losing the capabilities listed above, this would also mean
that KTTS users could not combine SpeechD with other KTTS plugins, as speech
from the other plugins would either block while SpeechD is speaking, or talk
simultaneously, depending upon their PC's audio capabilities.
KTTS provides 4 types/priorities of messages. In order of priority (highest
to lowest) they are:
Screen Reader. Interrupts all other messages, including other Screen Reader
outputs. Not a queue; there is only one Screen Reader output at a time.
Warning. Interrupts all lower-priority messages. Is a queue, so does not
interrupt other Warnings.
Message. Interrupts messages of type Text. Also a queue.
Text. Interrupted by all other message types. A queue.
Notice that none of these message types discard other messages except for
Screen Reader, which only discards other Screen Reader messages.
So I began looking at the message types/priorities in SpeechD API to see how
the KTTS message types would map onto them. Since only Screen Reader
discards other messages -- and only discards other Screen Reader messages, I
immediately eliminated all the SpeechD message types that can be discarded --
namely 'Text', 'Notification', and 'Progress'. This left only 'Message' and
'Important'. So it appears I could map KTTS 'Text' messages to SpeechD
'Message' messages, and KTTS 'Message' and 'Warning' messages to SpeechD
'Important' messages. (We have considered eliminating 'Warning' type
messages from KTTS anyway, so mapping both 'Warning' and 'Message' onto
'Important' would not be a hardship.) The following table summarizes:
KTTS Type SpeechD Type
---------------- ---------------------
Text Message
Message Important
Warning Important
Screen Reader ?
Now what to do about Screen Reader? The SpeechD message type that behaves
most like KTTS Screen Reader is 'Text', but 'Text' messages are lower
priority than 'Message' messages. Furthermore, 'Text' messages are discarded
by 'Message' messages, but strangely, not discarded by 'Important' messages.
Now it is possible I'm not reading the SpeechD API correctly. It may be that
I am misinterpreting the word "cancel" in the docs. Under 'Important', it
says
--
When a new message of level `important' comes during a message of another
priority is being spoken, this message other message is canceled and the
message with priority `important' is said instead. Other messages of lower
priorities are either postponed (priority `message' and `text') until there
are no messages of priority important waiting or canceled (priority
`notification' and `progress'.
--
Then under 'Message' type it says
--
If there are messages of priority `notification', `progress' or `text' waiting
in the queue or being spoken when a message of priority `message' comes,
these are canceled.
--
Here, I interpret "canceled" as meaning discarded. Even if I have that wrong,
and "canceled" just means postponed, it doesn't matter because 'Text'
messages are of lower priority than 'Message' or 'Important' and therefore
are not suitable for KTTS Screen Reader types.
So what I need is a message type like 'Important', but which interrupts and
discards itself. I thought about trying to use the SSIP CANCEL command to
simulate such a message type, but since I have no way of knowing what kind of
message SpeechD is currently speaking, that won't work.
Stopping for a moment and reflecting on these issues, I came to the
realization that SpeechD has a priority system that is ideal for Screen
Readers, but not so good for speaking longer texts, such as web pages, pdf
documents, or ebooks, while still providing interruption by higher-priority
messages. The 'Text', 'Notification', and 'Progress' types are ideal for
screen readers, but strangely are of lower priority than 'Important' or
'Message'. What seems to be missing is a "long text" type that is of lower
priority than 'Text', 'Notification', and 'Progress', but is never discarded
(unless application specifically cancels it.)
Given these issues, I cannot presently move forward with integrating SpeechD
with KTTS. If I had my way, SpeechD would offer callbacks to notify when
messages have been spoken. This would allow me to immediately write a plugin
for KTTS with the least amount of disruption to the existing KTTS
architecture and API. It would also allow us to migrate the entire kttsd
backend towards using SpeechD, although some additional changes to both APIs
would be needed in order to accomplish that.
Thanks for listening.
--
Gary Cramblitt (aka PhantomsDad)
KDE Text-to-Speech Maintainer
http://accessibility.kde.org/developer/kttsd/index.php
- [Speechd] KTTS and SpeechD integration,
Gary Cramblitt <=