[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Speechd] KTTS and SpeechD integration
From: |
Gary Cramblitt |
Subject: |
[Speechd] KTTS and SpeechD integration |
Date: |
Sun Aug 13 12:52:48 2006 |
Sorry for the late reply. I've been busy with some other things.
I'm very encouraged by your reply and believe we can reach a mutually
acceptable solution.
On Wednesday 04 May 2005 05:18 pm, Hynek Hanke wrote:
> Hello Gary,
>
> thank you for your detailed description of the goals and remaining issues.
> I'll try to comment on the later.
>
> 1) Synchronization with client applications
>
> > If a synth cannot return a wav file, the next ideal
> > plugin asynchronously speaks a message, sending directly to audio device,
> > and notifies KTTS when the speech output is finished.
>
> I think this is the way we can go. Speech Dispatcher currently doesn't
> support notification, but many things are already prepared so that this can
> be implemented. Speech dispatcher has all the necessary information
> internally, so the only remaining issue is how to communicate this
> information to the client application. When that is solved/implemented, it
> will notify client applications not only about the beginning and the end of
> speaking a message, but also about reaching index marks, which you can
> insert into SSML messages (when the end synthesizer allows).
>
> I'll work on it now. I'm going to post a separate email about this soon.
I'm willing to forego the requirement to return a wav file if SpeechD can
offer the functionality needed. I have some suggestions for how we might
want to implement callbacks. I'll send in a separate email to this list.
>
> 2) Sentence boundary detection
>
> > KTTS parses text into individual sentences and sends them one at a time
> > to the synth plugin. This is key in order to provide:
>
> We have also been doing this in Speech Dispatcher some time ago, but it
> turned out not to be the best approach for several reasons I'll explain
> bellow. I believe festival-freebsoft-utils now offers a better solutions
> which, in connection with index marking, could address these goals too.
>
> There is a number of issues we found with cutting text into sentences in
> Speech Dispatcher and sending just sentences to the output modules and
> synthesizers.
>
> 1) The TTS has no possibility to do more advanced syntactic analysis. It is
> only allowed to operate on one sentence.
>
> 2) We need to handle language dependent issues in a project (Speech
> Dispatcher, KTTSD) that should be language independent.
>
> 3) How to cut SSML or other markup into sentences?
>
> 4) How to cut data that are not an ordinary text (program source code, ...)
>
> 5) It makes the output module much more complicated if good performance is
> of concern. It's necessary to already have sent for synthesis the next
> sentence before the previous one is spoken in the speakers so that the TTS
> doesn't sit idle. Sentences of different length may cause unnecessary
> delays.
KTTS actually addresses all these issues fairly well. I'll explain in a
separate email, but it is academic because KTTS can forego Sentence Boundary
Detection so long as needed functionality is provided by SpeechD.
>
> It should not be a problem to pass the whole text into the TTS at once as
> long as:
>
> 1) The TTS is able to return partial results as soon as they are available
> (so that we don't have to wait untill the whole text is synthesized).
>
> 2) The TTS provides synchronization information (index marks).
>
> 3) The TTS is able to start synthesizing the text from an arbitrary index
> mark in the given (complete) text.
>
> Currently, (1) and (2) are provided by festival-freebsoft-utils. What do
> you Milan think about (3)?
>
> Let me explain how these three would address your needs.
>
> > 1. Ability to advance or rewind.
>
> The output module has the information about the position in the text from
> receiving the index marks (2). It can skip sentences forwards or backwards
> by sending the whole text again together with the identification number of
> index mark you it wants to start from, according to (3).
In the case of advance/rewind of text that is not yet completely spoken,
you might want to consider providing a way to do this without the need to
resend text, since I'm assuming that SpeechD still has the text. Otherwise,
KTTS would need to maintain a queue of its own. My objective is to remove
all queuing from KTTS and let SpeechD handle that.
>
> > 2. Ability to intermix with higher-priority messages.
>
> When a higher-priority message comes, the output module knows the position
> from the last received index mark (2), so it can instruct the TTS to start
> playing there again (3) when that higher-priority message is spoken.
>
> > 3. Ability to change voice or synth in the middle of a long job.
>
> This is very similar to the previous point.
>
> > 4. Notification to apps of start/end of each sentence as well as text
> > job as a whole.
>
> Yes, this is very desirable and (2) ensures it's possible.
>
> Currently, this is implemented between Speech Dispatcher and Festival in
> the following way. Speech Dispatcher still does some basic sentence
> boundary detection, but only to insert some kind of private index marks
> (additionally to index marks inserted by the client application). Ideally,
> this should also somehow be handled by Festival or the other TTS in use in
> the future so that Speech Dispatcher doesn't have to do anything with the
> text itself. Then the whole text (in SSML) is sent to Festival for
> synthesis and the function (speechd-next) Festival function is repeatedly
> called to retrieve the newly synthesized text or the information about an
> index mark being reached.
ok
>
> 3) Priority models
>
> > KTTS Type SpeechD Type
> > ---------------- ---------------------
> > Text Message
> > Message Important
> > Warning Important
>
> This mapping would go strongly against the philosophy of Speech Dispatcher,
> because the priority important is really only to be used for short
> important messages, not for ordinary messages, to prevent ``polution'' of
> the queues with important messages which can't be discarded nor postponed
> by any other priority.
>
> Let me stop for a moment and explain our view on the message management in
> accessibility. We don't imagine accessible computer for the blind as just a
> screen reader, we also think about the concept of what we currently call
> application reader. I'll explain the difference.
>
> A screen reader is a general purpose tool which you can use to do most of
> your work and it should provide you a reasonable level of access to all
> aplications that don't have some kind of barier in themselves.
>
> Screen reader typically only sees the surface of what actually is contained
> in the application. It might be customized to some degree for that
> particular application, but still it doesn't have access to some
> information the application knows.
>
> So it might make sense to build accessibility to some applications
> directly, so that these accessibility solutions can take advantage of the
> particular details of the application and of the information that are
> available to it when it operates *inside* the application.
>
> I'll give two examples.
>
> One of them is what we are currently doing with Emacs in speechd-el.
> speechd-el knows the origin of each message so it can decide which
> information is and which isn't important for the user (according to user
> configuration of course) and take advantage of the advanced priority model.
> It knows the language and character encoding of buffers, so it can switch
> languages automatically. It can perform other actions that a general
> purpose screen reader couldn't. Maybe Milan Zamazal can give better
> examples.
>
> Another example is the GNU Typist package. If you are not familiar with,
> it's a very good tool to help people learn touch typing. While in the
> previous case of Emacs, building accessibility into the application
> directly and not relying on a general purpose screen reader was a
> convenience, for GNU Typist that will be an absolute need. I can't imagine
> how would you be trying to learn touch typing just by reading the
> application with a general purpose screen reader -- you need dictation, you
> need notification about errors and so on and so on, something that a
> general purpose screen reader without any understanding of the application
> can't provide.
>
> Of course these specialized solutions, which we call application readers,
> should not be hardwired into the main program. Rather, they should be some
> extension or an additional layer built on top of the program.
>
> Let's return to the priority model. Now you understand that once we bypass
> the earliest stage in the GUI, we will get to the situation that we already
> are in textmode. There will typically be more clients connected to the high
> level TTS API in use (KTTSD, Speech Dispatcher), not just the screen
> reader. Some of them might be able to make a good use of the whole priority
> system, some not.
>
> Now you come to understand why we don't strictly separate the screen reader
> and others in Speech Dispatcher and why we don't have any priority that an
> application could use to have a full and complete controll over everything
> said. Rather, applications are supposed to use priorities as text and
> message and when such a situation happens that there is some more important
> message to be spoken (either originating in these applications or somewhere
> else), than this message *should* pass through and the original messages
> will be postponed or discarded.
>
> I think thinking in these terms should be the basis for our future work.
> Obviously, the priority system in Speech Dispatcher is not optimal and we
> will have to continue working on it.
>
> Milan explained the ideas under the different priorities in a separate
> email, I tried to explain the motivation for them. I very much welcome all
> your comments about all of this.
I think the element that is missing is thinking *only* in the context of
accessibility. Those with good sight will still want to make use of TTS for
reading of ebooks, or speaking web pages while viewing something else. The
problem with the current SpeechD model, as I explained, is its tendency to
discard text for all but high priority messages. For an ebook reader, that's
just not acceptable.
>
> I really like the ``long text'' idea. Maybe the nicest behavior would be if
> every incomming higher-priority message would just pause reading ``long
> text'' and then resume it again after the higher-priority message is
> spoken. This way, a user could be reading an ebook while still being able
> to listen to time and new-email notifications, being able to jump into the
> mixer to adjust the sound volume etc. What do you think of it?
This would solve the problem. The KTTS to SpeechD mapping might then look
like this:
KTTS Type SpeechD Type
---------------- ---------------------
text long text
warning/message message
screen reader see below
We will want to enhance the KTTS API to offer all the SpeechD priorities, so
Screen Readers would use whatever is appropriate (important, message, text,
progress, notification). (Since there are no Screen Readers in KDE at the
moment, it is not a problem to remove the existing Screen Reader type from
the KTTS API.) We should also deprecate the KTTS warning/message types and
advise programmers to choose from the full SpeechD priority model depending
upon the application's needs.
>
> 4) Conclusions
>
> > Given these issues, I cannot presently move forward with integrating
> > SpeechD with KTTS. If I had my way, SpeechD would offer callbacks to
> > notify when messages have been spoken.
>
> It will.
>
In summary, I have two essential needs in order to move forward with
integrating KTTS with SpeechD.
1. Callbacks. (Begin/End of sentence/job. By "job", I mean that I need to
know when the last sentence of a request has been spoken, otherwise, I need
to do sentence boundary detection in order to count the sentences to know
when a job has finished.)
2. long text priority
As I mentioned, I'll send some ideas for callbacks in a separate email.
I want to mention timing a bit. As you probably know, KDE will be switching
to Qt4 beginning sometime in the next few months. (I believe Qt4 final is
scheduled for June of this year.) Qt4 offers the GUI framework that makes a
KDE Screen Reader possible. We will most likely adapt Gnopernicus to use the
Qt4/AT-SPI, but that is by no means decided for sure. It would be a good
thing if the KTTS/SpeechD integration were well on its way to being finished
when the time comes to integrate with Qt4/AT-SPI/Gnopernicus. I can't say
with certainty, but I believe early Fall would be the timeframe when all this
will be occurring. It would be *really* nice if we could announce the
completion (or near completion) of KTTS/SpeechD integration at the annual KDE
conference (aKademy 2005), which begins August 26th. Do you think this
might be possible?
Regards
--
Gary Cramblitt (aka PhantomsDad)
KDE Text-to-Speech Maintainer
http://accessibility.kde.org/developer/kttsd/index.php