[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[gnuspeech-contact] Status of GnuSpeech
From: |
D.R. Hill |
Subject: |
[gnuspeech-contact] Status of GnuSpeech |
Date: |
Tue, 28 Sep 2004 17:13:38 -0600 (MDT) |
Hi Lee,
Thanks for your query.
Articulatory synthesis is a method of producing synthetic speech from a
waveguide (or "tube") model of the human vocal & nasal tracts.
Conventional synthesis either takes small segments of real speech (usually
represented in Linear Predictive Coded form, with the pitch effect
removed) and concatenates them, before re-imposing some pitch (intonation)
contour; or sends parameters to a set of bandpass filters, to vary their
frequencies, and feeds a voicing waveform and/or suitable noise through
them (either in parallel or in series), with further filtering to account
for things like the radiation impedance of the lips. DECTalk, as used by
Stephen Hawking is an example of the latter, and the method dates back to
the 50s, and is called "formant synthesis". The concatenation method is
called "concatenative synthesis". Both methods have their problems and
advantages.
You will find papers relevant to understanding the details and advantages
of "articulatory synthesis" based on the tube model on my university web
site amongst the papers that are available there:
http://www.cpsc.ucalgary.ca/~hill
You can also find a wealth of excellent information by going to Julius
Smith's website at Stanford University.
http://ccrma.stanford.edu/~jos
and link elsewhere from there, if necessary.
We call the "tube model" an "articulatory synthesiser" because it is
controlled using what is called the "Distinctive Region Model" due to Rene
Carre at the ENST in Paris who built it on the basis of work in 1973 at
the Speech Technology Lab, KTH, Stockholm by Gunnar Fant and his
colleagues.
The essence of the control method is to vary the diameter each of eight
"distinctive" regions of the tube as happens in the real vocal tract. The
regions are defined by the "formant sensitivity analysis"carried out by
Fant which showed that the effect of constrictions in each region have a
specific, independent effect of the values of the three "formants" or
resonant peaks in the speech spectrum that determine the identity of the
speech sounds. Carre showed that these regions also correspond fairly
closely to the distribution of the articulators in the real human vocal
tract, so that our articulators are appropriately positioned to effect
just the constrictions required for the DRM model. There is provision in
our scheme for mapping the DRM regions onto specific articulatory gestures
but, so far, we have simply used the DRM regions directly, togther with
information on rhythm and intonation derived from research at the U of C
and other places. Those who have heard the speech comment that it is the
best they have heard, though in fact I think it still needs a lot of
improvement -- we only have a first cut at the "posture" (phone
articulation) data so far. The same listeners tell us it is much less
tiring to listen to than conventional synthetic speech (by which they mean
formant synthesis -- concatenative synthesis tends to be confined to
rather short utterances for things like telephone intercepts, in
practice).
The "articulatory" synthesis allows a wide variety of different voice
types to by used, simply by varying things like the tube length, pitch,
breathiness (especially for female speech) and so on, and the rhythm and
intonation models are based on a generalised abstraction of real speech.
I hope this meets you initial needs for information. I attach a .snd file
that provides a comparison of the word "hello" spoken by male, female and
child voices emanating from the tube. If your system doesn't like .snd
files, just change the extension to .au -- they are the same.
I should add that we have a complete database and system capable of
producing continuous speech running under NeXTSTEP 3.x, and this is what
is available under a GPL check the CVS repository) and what is being
ported to other systems, particularly GnuStep and OS X.
Unfortunately, GnuStep is not widely available and somewhat immature
still, so progress on that is slow -- the various system components
include some that are GUI-intensive. I am considering building from
scratch for Linux, without using GnuStep, but that is a *major*
programming effort.
Thanks you for your interest.
All good wishes.
david
-----
David R. Hill, Computer Science, U. Calgary | Imagination is more
Calgary, AB, Canada T2N 1N4 Ph: 604-947-9362 | important than knowledge.
address@hidden OR address@hidden| (Albert Einstein)
http://www.cpsc.ucalgary.ca/~hill | Kill your television!
----
Lee Butterman wrote:
From address@hidden Tue Sep 28 07:59:15 2004
Date: Tue, 28 Sep 2004 09:54:22 -0400
From: Lee Butterman <address@hidden>
To: address@hidden
Subject: status of gnuspeech
Hi, I was wondering about two things. First, I've never heard articulatory
synthesis before, so I was wondering if you had any pre-synthesized examples
just to demonstrate how it sounds. Secondly, what's the status of gnuspeech?
Will there every be, say, an MBROLA-like interface, where you've got some
detachable module that takes phonemes (along with tables/models of their
positions in the mouth?) and then synthesizes speech?
Thanks so much,
Lee
helloComparison.snd
Description: Basic audio