[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Speechd] KTTS and Sentence Boundary Detection
From: |
Gary Cramblitt |
Subject: |
[Speechd] KTTS and Sentence Boundary Detection |
Date: |
Sun Aug 13 12:52:48 2006 |
As I mentioned in my other email, I'd like to comment on the following and
point out how KTTS addressed these issues. It is academic, since the
intention is to remove Sentence Boundary Detection (SBD) from KTTS, as long
as the functionality needed is provided by Speech Dispatcher. It may still
be of interest to you however.
On Wednesday 04 May 2005 05:18 pm, Hynek Hanke wrote:
> There is a number of issues we found with cutting text into sentences in
> Speech Dispatcher and sending just sentences to the output modules and
> synthesizers.
>
> 1) The TTS has no possibility to do more advanced syntactic analysis. It is
> only allowed to operate on one sentence.
In practice, none of the current synths alter the speaking attributes of one
sentence based on surrounding sentences. It is true that syntactic analysis
does assist Festival at deciding where sentence boundaries are. If we assume
that KTTS puts sentence boundaries at the same places as Festival (and for
the most part it does), then the end result is the same.
>
> 2) We need to handle language dependent issues in a project (Speech
> Dispatcher, KTTSD) that should be language independent.
KTTS addresses differences in SBD for different languages by using a modular
plugin architecture. In theory, languages that use different punctuation and
so forth for sentences can be implemented in a separate SBD filter. In
practice, so far, only the Polish language has needed a separate SBD filter,
mostly because Polish Festival incorrectly "speaks" punctuation characters.
("This is a sentence." is spoken as "This is a sentence period" -- in Polish
of course.) Since SBD is implemented using regular expressions, the Polish
SBD filter was a simple matter of changing the regular expression to remove
the sentence punctuation while simultaneously breaking the input into
sentences.
>
> 3) How to cut SSML or other markup into sentences?
In KTTS, we addressed this with an eye towards the ability to advance and
rewind by sentence. To achieve this, the SSML input is parsed using an XML
parser. <p> and <s> tags are obviously interpreted as sentence boundaries.
Within text and CDATA Section nodes, the same regular expression as is used
for plain text is used to decide where sentence boundaries are. Once the
position of sentence boundaries is determined, each sentence is output with a
complete set of SSML tags. In this way, each sentence gets a complete SSML
context, so that when rewinding and advancing, no information is lost. For
example, the following input SSML
<speak lang="en">
This is a sentence. So is this. This <prosody rate="fast">is spoken
fast</prosody>. <p>This is the fourth sentence.</p>
</speak>
becomes
<speak lang="en"><voice gender="neutral" age="40"><prosody pitch="medium"
range="medium" rate="medium" volume="medium"> This is a
sentence.</prosody></voice></speak><speak lang="en"><voice gender="neutral"
age="40"><prosody pitch="medium" range="medium" rate="medium"
volume="medium">So is this.</prosody></voice></speak><speak lang="en"><voice
gender="neutral" age="40"><prosody pitch="medium" range="medium"
rate="medium" volume="medium">This </prosody></voice><voice gender="neutral"
age="40"><prosody pitch="medium" range="medium" rate="fast"
volume="medium">is spoken fast</prosody></voice><voice gender="neutral"
age="40"><prosody pitch="medium" range="medium" rate="medium"
volume="medium">.</prosody></voice></speak><speak lang="en"><voice
gender="neutral" age="40"><prosody pitch="medium" range="medium"
rate="medium" volume="medium">This is the fourth
sentence.</prosody></voice></speak>
In the case of Festival, SSML is then converted into SABLE tags using an XSLT
conversion.
This works out pretty well. Some of the current limitations are 1) we don't
handle SSML "relative" attributes (<prosody rate="+10">), 2) Festival seems
to have trouble with voice attributes, so we strip them out when converting
to SABLE, and 3) we don't handle the <say-as> tag, which isn't fully defined
in the SSML spec anyway.
We handle HTML by first making sure it is valid XHTML and then using XSLT to
convert it to SSML.
>
> 4) How to cut data that are not an ordinary text (program source code, ...)
Once again, the plugin filter architecture of KTTS can address this. We
currently handle C/C++ text (we assume each EOL is a sentence boundary). In
practice, speaking code has lots of other problems because of punctuation and
"words" that are not in the lexicon, which tend to confuse Festival quite a
bit, so there is a lot of work that still needs to be done on this.
>
> 5) It makes the output module much more complicated if good performance is
> of concern. It's necessary to already have sent for synthesis the next
> sentence before the previous one is spoken in the speakers so that the TTS
> doesn't sit idle. Sentences of different length may cause unnecessary
> delays.
Since KTTS is designed for each synth to return a wav file, the synths can be
kept busy working 3 or 4 sentences ahead while KTTS simultaneously outputs a
sentence to the audio device. Hence, sentences of different lengths are not
a problem. The first sentence begins speaking quickly because only a single
sentence is sent to the synth for parsing and synthesis. In practice, the
synths sit idle most of the time because KTTS builds up a queue of 3 or 4
sentences that have already been synthesized while the first sentence is
still being heard on the audio device. As you know, synthesis time is much
shorter compared to audio time.
That said, SBD does add a small delay before the first sentence is spoken, and
the larger the input, the longer the delay. SSML input adds additional
delay. The ideal synth would not perform SBD on the entire input before it
begins speaking the first sentence, but I don't believe Festival is so
optimized. Does Speech Dispatcher do something to solve this problem? If
the synth were so optimized, I imagine it would be problematic to provide
advance/rewind capability within the synth.
One of the things I'm looking forward to when we integrate KTTS with Speech
Dispatcher is improved performance, since I know you've worked hard to
address that.
Thanks for listening.
--
Gary Cramblitt (aka PhantomsDad)
KDE Text-to-Speech Maintainer
http://accessibility.kde.org/developer/kttsd/index.php