[gnuspeech-contact] Status of GnuSpeech

gnuspeech-contact

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[gnuspeech-contact] Status of GnuSpeech

From:	D.R. Hill
Subject:	[gnuspeech-contact] Status of GnuSpeech
Date:	Tue, 28 Sep 2004 17:13:38 -0600 (MDT)

Hi Lee,

Thanks for your query.

Articulatory synthesis is a method of producing synthetic speech from awaveguide (or "tube") model of the human vocal & nasal tracts.Conventional synthesis either takes small segments of real speech (usuallyrepresented in Linear Predictive Coded form, with the pitch effectremoved) and concatenates them, before re-imposing some pitch (intonation)contour; or sends parameters to a set of bandpass filters, to vary theirfrequencies, and feeds a voicing waveform and/or suitable noise throughthem (either in parallel or in series), with further filtering to accountfor things like the radiation impedance of the lips. DECTalk, as used byStephen Hawking is an example of the latter, and the method dates back tothe 50s, and is called "formant synthesis". The concatenation method iscalled "concatenative synthesis". Both methods have their problems andadvantages.

You will find papers relevant to understanding the details and advantagesof "articulatory synthesis" based on the tube model on my university website amongst the papers that are available there:


        http://www.cpsc.ucalgary.ca/~hill

You can also find a wealth of excellent information by going to JuliusSmith's website at Stanford University.


        http://ccrma.stanford.edu/~jos

and link elsewhere from there, if necessary.

We call the "tube model" an "articulatory synthesiser" because it iscontrolled using what is called the "Distinctive Region Model" due to ReneCarre at the ENST in Paris who built it on the basis of work in 1973 atthe Speech Technology Lab, KTH, Stockholm by Gunnar Fant and hiscolleagues.

The essence of the control method is to vary the diameter each of eight"distinctive" regions of the tube as happens in the real vocal tract. Theregions are defined by the "formant sensitivity analysis"carried out byFant which showed that the effect of constrictions in each region have aspecific, independent effect of the values of the three "formants" orresonant peaks in the speech spectrum that determine the identity of thespeech sounds. Carre showed that these regions also correspond fairlyclosely to the distribution of the articulators in the real human vocaltract, so that our articulators are appropriately positioned to effectjust the constrictions required for the DRM model. There is provision inour scheme for mapping the DRM regions onto specific articulatory gesturesbut, so far, we have simply used the DRM regions directly, togther withinformation on rhythm and intonation derived from research at the U of Cand other places. Those who have heard the speech comment that it is thebest they have heard, though in fact I think it still needs a lot ofimprovement -- we only have a first cut at the "posture" (phonearticulation) data so far. The same listeners tell us it is much lesstiring to listen to than conventional synthetic speech (by which they meanformant synthesis -- concatenative synthesis tends to be confined torather short utterances for things like telephone intercepts, inpractice).

The "articulatory" synthesis allows a wide variety of different voicetypes to by used, simply by varying things like the tube length, pitch,breathiness (especially for female speech) and so on, and the rhythm andintonation models are based on a generalised abstraction of real speech.

I hope this meets you initial needs for information. I attach a .snd filethat provides a comparison of the word "hello" spoken by male, female andchild voices emanating from the tube. If your system doesn't like .sndfiles, just change the extension to .au -- they are the same.

I should add that we have a complete database and system capable ofproducing continuous speech running under NeXTSTEP 3.x, and this is whatis available under a GPL check the CVS repository) and what is beingported to other systems, particularly GnuStep and OS X.

Unfortunately, GnuStep is not widely available and somewhat immaturestill, so progress on that is slow -- the various system componentsinclude some that are GUI-intensive. I am considering building fromscratch for Linux, without using GnuStep, but that is a *major*programming effort.


Thanks you for your interest.

All good wishes.

david
-----
David R. Hill, Computer Science, U. Calgary  | Imagination is more
Calgary, AB, Canada T2N 1N4 Ph: 604-947-9362 | important than knowledge.
address@hidden OR address@hidden|         (Albert Einstein)
http://www.cpsc.ucalgary.ca/~hill            | Kill your television!
----
Lee Butterman wrote:

From address@hidden Tue Sep 28 07:59:15 2004

Date: Tue, 28 Sep 2004 09:54:22 -0400
From: Lee Butterman <address@hidden>
To: address@hidden
Subject: status of gnuspeech

Hi, I was wondering about two things.  First, I've never heard articulatory
synthesis before, so I was wondering if you had any pre-synthesized examples
just to demonstrate how it sounds.  Secondly, what's the status of gnuspeech?
Will there every be, say, an MBROLA-like interface, where you've got  some
detachable module that takes phonemes (along with tables/models of their
positions in the mouth?) and then synthesizes speech?

Thanks so much,
Lee

helloComparison.snd
Description: Basic audio

[Prev in Thread]

Current Thread

[Next in Thread]

[gnuspeech-contact] status of gnuspeech, Lee Butterman, 2004/09/28
- [gnuspeech-contact] Status of GnuSpeech, D.R. Hill <=

Prev by Date: [gnuspeech-contact] status of gnuspeech
Previous by thread: [gnuspeech-contact] status of gnuspeech
Index(es):
- Date
- Thread