|Subject:||Re: [gnuspeech-contact] new to gnuspeech|
|Date:||Sun, 2 Jan 2011 13:06:01 -0800|
Good to know you succeeded in getting everything up and working under Ubuntu. Well done! You have asked a couple of excellent questions.
(1) Many of the facilities provided by SSML are provided within the gnuspeech system (dealing with dates, dollar amounts, fractions etc.; and relying on explicit phonetic information on how to pronounce words: there are three levels of pronouncing dictionary -- user, application, which are customisable, and main -- that are used before falling back to letter-to-sound rules). It might be a good idea to modify these components to use SSML as an intermediate notation, but I'd have to think about the implications. At present, such data is created as a new language is developed, and the rhythm and intonation are added according to models based on accepted linguistic knowledge, with facilities for a user to modify things like information points to allow special emphasis, or modification of meaning etc.
The whole idea is to have a model which minimises the effort required by the originator of the speech script, and keeping within the norms for a speaker of the language (and later dialect) whilst still allowing considerable customisation. We don't have a grammatical analysis component yet, but one is obviously needed to disambiguate variants of "read" and such like, as well as for other purposes. Ultimately one needs to be able to *understand* what is to be said, in a deep sense, in order to speak it properly.
(2) There is no proper documentation on how we developed the data-bases for spoken English, though the various papers give insight into the models used for rhythm and intonation, and the derivation of the dictionaries is straightforward English phonetics. I really ought to write a paper explaining what we did. In brief, we spent a long time with a Kay Sonograf, Monet, the the tube model controlled by the "Synthesizer" app, with a decent sound system. Using these we created a database representing -- in terms of the articulatory parameters -- the notional postures underlying speech (even stop sounds have such postures); and transition database to cover the context-dependent dynamics of speech production based on successive postures. This included figuring out the special events needed for plosives and the like. Various short utterances were synthesised and the results analysed on the Songraf and compared with real speech. The necessary modifications to produce better matches were made, and the process iterated until the synthetic speech was a reasonably good emulation of the real speech. It was somewhat complicated by the fact that we were controlling the articulatory parameters, but observing the effects in spectrographic space, and this is where the "Synthesizer" app was very useful in allowing us to coordinate the two representations.
I say we spent some time doing this, but it was, in retrospect, amazingly quick -- two of us (Leonard Manzara and myself) working for about three months, not even full-time, and during that time, Monet was also being developed so that were were able to feed back information to Craig Schock who was building Monet to add needed facilities to Monet or remove bugs we found.
Perhaps some of my colleagues can add to, or correct what I have said.
Hope this brief response helps. Please keep the questions coming.
All good wishes.
On Jan 2, 2011, at 12:32 PM, Paul Tyson wrote:
|[Prev in Thread]||Current Thread||[Next in Thread]|