Re: [gnuspeech-contact] voice control in articulatory vs. HMM-based synt

gnuspeech-contact

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [gnuspeech-contact] voice control in articulatory vs. HMM-based synt

From:	D.R. Hill
Subject:	Re: [gnuspeech-contact] voice control in articulatory vs. HMM-based synthesis
Date:	Tue, 22 Nov 2005 20:03:56 -0700 (MST)

Remember that a good part of "naturalness" in synthetic speech by rulesdepends on: (a) correct segmental-level pronunciation; (b) appropriaterhythm (this actually operates at the segmental as well as thesuprasegmental level along with the choice of vowel quality and quantity-- e.g. refuse [choose to reject], refuse [garbage], refuse [activate thebomb again]); and appropriate intonation (which also ties in with therhythm).

Regardless of the basis for synthesis, good sources of information for allof these are essential to achieve "naturalness". Some kind of dictionaryrelating the synthesis units to the target text -- syllable, word, phrase... -- will take care of much of the segmental level difficulty (but seewhat follows). You also need excellent models on which to base theconstruction of rhythm and intonation. But *all three* really require notjust grammatical and semantic analysis, but also understanding and evensituational analysis. In effect, the overall problem of natural speechsynthesis of arbitrary utterance involves solving the AI-hard problem ofunderstanding and relating to the real world. Thomas & Carroll in variouspapers have distinguished the kind of performance involved in naturalEnglish dialogue (which, fundamentally, is what is involved here) as"Design-Interpret" as opposed to the more simple minded "Encode-Decode"approach. Although they were specifically talking about interactivedialogue and the design of language-based interaction, rather thansynthetic speech, the same ideas apply. You have to understand what youare saying and even what you are trying to achieve in order to speakeffectively --i.e. naturally.

Of course, someone can put in markers to help a speech synthesiser, and(depending on how elaborate you are prepared to make the annotations) thesynthesiser has a much simpler job and a better chance of soundingnatural.

It is necessary to be fairly specific about what aspects of "naturalness"you are trying to emulate with your synthesiser, and what sources ofhelpful information are available -- including models of various aspectsof speech such as rhythm.


Not trying to be difficult, or clever, just trying to focus on issues.

One what basis are hmm, concatenative, articulatory, or whatever synthesisto be compared, and what criteria are to be used to evaluate. Very often,subjective testing with appropriate trained listeners usingwell-designed experimental protocols is the only reasonably sound basisfor such comparisons.


Just some thoughts.

david
---
David Hill, Prof. Emeritus, Computer Science  |  Imagination is more       |
U. Calgary, Calgary, AB, Canada T2N 1N4       |  important than knowledge  |
address@hidden OR address@hidden   |         (Albert Einstein)  |
http://www.cpsc.ucalgary.ca/~hill             |  Kill your television      |

On Tue, 22 Nov 2005, Eric Zoerner wrote:

I have a theoretical question for the list regarding comparison of speechsynthesis techniques and their capabilities for voice control/modification atruntime while maintaining naturalness.
It is pretty clear to me that articulatory speech synthesis potentially has agreat deal of flexibility when it comes to dynamically altering the voice,e.g. for natural intonation, emotional speech, singing, changing dialect orlanguage, or changing the identity/gender/age of the speaker, etc.
I am interested in comparing these capabilities to those in HMM-basedsynthesis. Can anyone comment on or point me to information regarding theextent that HMM-based synthesis (e.g. using the HTS toolkit) has capabilitiesin this regard?
Would it be fair to say that while there may be more control over the voiceduring the training phase in HMM-based synthesis as compared tounit-concatenative approaches, the feasibility of controlling the voice atruntime in HMM-based synthesis is about as limited as that withunit-concatenation (i.e. without losing its perceived "naturalness")?
_______________________________________________
gnuspeech-contact mailing list
address@hidden
http://lists.gnu.org/mailman/listinfo/gnuspeech-contact

[Prev in Thread]

Current Thread

[Next in Thread]

[gnuspeech-contact] voice control in articulatory vs. HMM-based synthesis, Eric Zoerner, 2005/11/22
- Re: [gnuspeech-contact] voice control in articulatory vs. HMM-based synthesis, D.R. Hill <=

Prev by Date: [gnuspeech-contact] voice control in articulatory vs. HMM-based synthesis
Next by Date: [gnuspeech-contact] intonation window
Previous by thread: [gnuspeech-contact] voice control in articulatory vs. HMM-based synthesis
Next by thread: [gnuspeech-contact] intonation window
Index(es):
- Date
- Thread