[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
eSpeak: API for shared library version
From: |
Hynek Hanke |
Subject: |
eSpeak: API for shared library version |
Date: |
Mon Sep 4 09:59:52 2006 |
Jonathan Duddington p????e v ??t 11. 07. 2006 v 10:33 +0100:
> Here is a draft API for a shared library version of the eSpeak TTS
> synthesizer, which is intended for use by Speech Dispatcher.
Thank you. I'm sorry for my late reply. I've been working on TTS API
and also on the latest release of Speech Dispatcher which now includes
the generic eSpeak output module.
> The following SSML tags are currently implemented:
> <speak> xml:lang xml:base
> <voice> xml:lang, name, gender, age, variant
> <prosody> rate, volume, pitch, range (relative)
> <say-as> glyphs
> <mark>
> <s> xml:lang
> <p> xml:lang
> <sub>
> <audio>
> <tts:style> punctuation, capital_letters
Very good job. You didn't merely implemented the SHOULD HAVE
functionality from the TTS API specs, but you went further and
implemented a part of the NICE TO HAVEs too.
One thing more I'd like to see done are the interpret-as attributes
tts:char and tts:key (missing SHOULD HAVEs). But it seems obvious
to me from your question bellow and from your API draft that you are
going to implement them too, so it all looks very well.
> Should <say-as interpret-as="tts:char"> speak the canonical unicode
> name for the character (eg.
> "LATIN SMALL LETTER E WITH CIRCUMFLEX"
> "RIGHT-POINTING DOUBLE ANGLE QUOTATION MARK" ), or is something
> shorter required?
Yes, whenever possible, a shorter form is required. When a blind
user goes over the screen with his cursor, he most certainly wants
to hear the information about each character as short as possible.
On the other hand, the information must be enough for him to understand
what is really written here. So if it is a character that you can
synthesize and the result is not ambigous, please do so. (In Czech,
both 'i' and 'y' are perfectly synthesizable, they however have the
same sound. So it is necessary to pronounce 'y' with its name.) If the
character is not directly synthesizable with the current language, what
you described above is a good option. As you see, this is language
dependent. This is the general idea. If you want or can make it
configurable, of course this is a good thing too.
Now about the API at
http://espeak.sourceforge.net/speak_lib.h
* Your handling of the additional <audio> element inside the text
seems like a reasonable approach. Maybe it would even be a good
approach to extend TTS API in this way in future versions. It seems
unreasonable to me to require the synthesizer to be able to open an
arbitrary URI (with respect to complication of code, priviledges etc.)
Just a quick summary for others (please correct me if I'm wrong):
The approach in eSpeak API is as follows: If an <audio> element
is reached during synthesis, a special callback is issued asking
the above layer if the sound is to be played or the alternative
text should be synthesized. If the sound is to be played, an event
is issued with the information from the <audio> tag about the URI
location etc. The above layer is then requested to take care of playing
the sound.
I however see a few problems with this approach too. First, it involves
a callback which must return a value, a concept not found anywhere else
in the current TTS API draft and a potential problem for serialized text
protocol implementation (like we intend, see
http://www.freebsoft.org/tts-api-provider). It also involves having two
different sources of sound data (the synthesizer and the above layer)
which is ugly in case we want to deliver the sound to its destination
over a socket (sockets have only one end on each side).
IMHO a slightly better idea would be to ask the above layer to get the
audio data and deliver them back to the synthesizer (or to indicate
the alternative text should be used). This solves the second problem and
is clearer in its design. It however still involves the problematical
asynchronous callbacks with a return value.
All of this is not a problem for the particular way the eSpeak API is
going to be used. It however would be a problem for synthesizers who
run in a separate proces (Festival) and communicate through some IPC.
I'd like to see the opinion of others on this. (This is currently a
nicer than NICE TO HAVE though ;)
* TTS API allways uses wchar_t or UTF-8, so this doesn't matter for TTS
API, I'd however suggest changing the constant
espeakCHARS_8BIT
to
espeakCHARS_8BIT_ISO
as there are usually many other (incompatible) 8bit encodings available
for each language other than the ISO ones.
* Shouldnt espeakCHARS_AUTO be described as
"_7_ bit or UTF8 (this is the default)" ?
8-bit encodings are not compatible with UTF-8, only the lower ASCII (7
bit) is.
* "espeak_Voice: the "languages" field consists of a list of (UTF8)
language names for which this voice may be used. [format is described]"
This is quite different from the TTS API specs. It seems your motivation
here was that a voice can be for more than one language/dialect pair.
If this makes sense, I'm all for changing the specs too. I however do to
see how a voice could speak two dialects. Do you have some example?
Apart from these minor details, I do not see any problems and I think it
is a very good support for TTS API. Please apologize that we are a
little bit late with the drivers now and so it will take us a little bit
more time until we can produce some implementation and really test it.
With regards,
Hynek Hanke
_______________________________________________
Speechd mailing list
address@hidden
http://lists.freebsoft.org/mailman/listinfo/speechd