[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [gnuspeech-contact] voice control in articulatory vs. HMM-based synt
Re: [gnuspeech-contact] voice control in articulatory vs. HMM-based synthesis
Tue, 22 Nov 2005 20:03:56 -0700 (MST)
Remember that a good part of "naturalness" in synthetic speech by rules
depends on: (a) correct segmental-level pronunciation; (b) appropriate
rhythm (this actually operates at the segmental as well as the
suprasegmental level along with the choice of vowel quality and quantity
-- e.g. refuse [choose to reject], refuse [garbage], refuse [activate the
bomb again]); and appropriate intonation (which also ties in with the
Regardless of the basis for synthesis, good sources of information for all
of these are essential to achieve "naturalness". Some kind of dictionary
relating the synthesis units to the target text -- syllable, word, phrase
... -- will take care of much of the segmental level difficulty (but see
what follows). You also need excellent models on which to base the
construction of rhythm and intonation. But *all three* really require not
just grammatical and semantic analysis, but also understanding and even
situational analysis. In effect, the overall problem of natural speech
synthesis of arbitrary utterance involves solving the AI-hard problem of
understanding and relating to the real world. Thomas & Carroll in various
papers have distinguished the kind of performance involved in natural
English dialogue (which, fundamentally, is what is involved here) as
"Design-Interpret" as opposed to the more simple minded "Encode-Decode"
approach. Although they were specifically talking about interactive
dialogue and the design of language-based interaction, rather than
synthetic speech, the same ideas apply. You have to understand what you
are saying and even what you are trying to achieve in order to speak
effectively --i.e. naturally.
Of course, someone can put in markers to help a speech synthesiser, and
(depending on how elaborate you are prepared to make the annotations) the
synthesiser has a much simpler job and a better chance of sounding
It is necessary to be fairly specific about what aspects of "naturalness"
you are trying to emulate with your synthesiser, and what sources of
helpful information are available -- including models of various aspects
of speech such as rhythm.
Not trying to be difficult, or clever, just trying to focus on issues.
One what basis are hmm, concatenative, articulatory, or whatever synthesis
to be compared, and what criteria are to be used to evaluate. Very often,
subjective testing with appropriate trained listeners using
well-designed experimental protocols is the only reasonably sound basis
for such comparisons.
Just some thoughts.
David Hill, Prof. Emeritus, Computer Science | Imagination is more |
U. Calgary, Calgary, AB, Canada T2N 1N4 | important than knowledge |
address@hidden OR address@hidden | (Albert Einstein) |
http://www.cpsc.ucalgary.ca/~hill | Kill your television |
On Tue, 22 Nov 2005, Eric Zoerner wrote:
I have a theoretical question for the list regarding comparison of speech
synthesis techniques and their capabilities for voice control/modification at
runtime while maintaining naturalness.
It is pretty clear to me that articulatory speech synthesis potentially has a
great deal of flexibility when it comes to dynamically altering the voice,
e.g. for natural intonation, emotional speech, singing, changing dialect or
language, or changing the identity/gender/age of the speaker, etc.
I am interested in comparing these capabilities to those in HMM-based
synthesis. Can anyone comment on or point me to information regarding the
extent that HMM-based synthesis (e.g. using the HTS toolkit) has capabilities
in this regard?
Would it be fair to say that while there may be more control over the voice
during the training phase in HMM-based synthesis as compared to
unit-concatenative approaches, the feasibility of controlling the voice at
runtime in HMM-based synthesis is about as limited as that with
unit-concatenation (i.e. without losing its perceived "naturalness")?
gnuspeech-contact mailing list