[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Accessibility] Can you help write a free version of HTK?

From: Eric S. Johansson
Subject: Re: [Accessibility] Can you help write a free version of HTK?
Date: Mon, 12 Jul 2010 13:03:59 -0400
User-agent: Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv: Gecko/20100608 Thunderbird/3.1

 On 7/12/2010 4:24 AM, Bill Cox wrote:
Hi, Eric.  You make some good points below.

and I'm glad we are both in agreement on the chunks of this problem. I'll pick away at responding to these questions because the they all deserve careful answers.

On Fri, Jul 9, 2010 at 2:06 PM, Eric S. Johansson<address@hidden>  wrote:
The consensus of this August body was that all of the speech recognition
toolkits out there (Julius HEK, Sphinx) were all designed to keep graduate
students busy but not designed for use in the real world. I did take a look
at Simon, it looks like it's the closest of the bunch but I estimate it
somewhere between 5 to 8 years away from being useful (i.e. on parity with
I agree that these tools are grad-student research oriented.  I also
agree that a major rewrite may be required to compete with Naturally
Speaking.  However, I disagree that we have to be competitive with
Naturally Speaking to be productive programming by voice.  I used
Dragon Dictate in 1996, and later Naturally Speaking.  I found that
Dragon Dictate was just about the same as Naturally Speaking in terms
of productivity for writing code.  The main problem was that Naturally
speaking would make me pause between commands, like Dragon Dictate,
and only did continuous recognition for dictating text.

I apologize but I'm probably going to go on at length about issues and programming by voice. I will harp on issues of not damaging a person's throat as well as vocabulary and name structure after that, I apologize.

My rationale for requiring the equivalent to NaturallySpeaking is twofold. First is programmatic control over the environment and second is vocal load. NatuurallySpeaking has far better control over its environment through ssome toolkit like dragonfly then DragonDictate ever did. I believe that's because of the environments (Windows versus DOS). Having used DragonDictate, I found it incredibly wearing on my throat. I had to go to a throat specialist in strain caused by speech recognition and then had to layoff the product for a few weeks while while the muscles recovered. This is not an uncommon story. Almost every single person on the voice coder mailing list has had this kind of event happen. Discreet utterance speech recognition really does nasty things to your throat as does non-natural sentence construction in commands with NaturallySpeaking. You find yourself wanting to say more but you have to tighten up your throat stop because the grammar doesn't match the way your brain wants to speak.

I know this is random but remember it. One of the things Susan Cragin did in OSSRI was she managed to get the dougout recognizer relicensed. I think it's under an MIT/X11 license. I'll CC her on this message (hey Susan, sign up and say hi to everybody)

In any case, there's one thing to remember about programming by voice. You need the same vocabulary size for coding as you do for writing. This two reasons for that which I will let you think about and I will address later.
Simon's approach can reduce the active vocabulary at any time to
perhaps a couple of hundred words or less, apparently enabling high
accuracy.  If we could have continuous command recognition, we could
easily beat my old productivity.  I read that there's a newer tool
called vocola, which enables continuous command recognition with
Naturally Speaking.
vocola, Unimacro, dragonfly all do continuous recognition macros. with raw natlink, you can even do it in Python.It's a hard problem, I agree.

> Perhaps I'm somewhat wishful in my
thinking, but no matter how I do the calculation, I estimate we have
many times more potential volunteers than such a project will require.
  I think the main trick is finding advisors who do have the extensive
knowledge about how to make good recognition engines, and effectively
organising volunteers.  I think you probably would agree that
Naturally Speaking is not the only good recognition engine ever sold.
There should be experts around from failed or abandoned efforts who
could help as advisors.  Give me one or two of those guys, a dozen
motivated volunteer voice coders, and three years, and I think we
could get there.

I agree about many good recognition engines out there. They all fail in different ways but they can work. One of the huge challenges we will face is navigating the patent land mines of other peoples technology. I think this is one of the reasons why nuance is been on an acquisition been. I believe They're trying to buy up as many patents as they can to protect themselves against any market intrusion. one way to defend ourselves would be through our own license acquisition. For example, look at what Cornell did with video codecs. The patent license terms said that if it was used in an open-source project, then there was no charge and no risk of being sued by Cornell for infringement. If we could build a similar licensing patent portfolio from other players, that might help us take advantage of pr-existing work versus reinventing the wheel to get around the patent.
I'm using Google's cloud-computing gmail service to write this e-mail.
  I typically review them with a closed-source binary TTS called voxin.
  I've been contacted by Skype twice today, and I've watched a couple
flash videos.  I think we are in violent agreement on this point.
People with disabilities need solutions, not a philosophy.

wow. I have Google accounts but I rarely use them and certainly not for anything important. :-) But yes, people with disabilities do need solutions first. We need to, as Christian philosophy says, teach them how to fish at the same time, we, as the Buddhist philosophy teaches us, need a right livelihood.
Let's look at where we are.  In the early 1990's a tiny company wrote
Dragon Dictate, using the signal processing hardware in the sound card
to make speech recognition on PC's useful for the first time.  They're
market was exclusively people with physical impairments.  I discovered
them in 1996, when I needed them to remain a programmer.  There may
have been some new code written by the community to get around the
crap we get from Nuance, but it seems that the tools they ship hasn't
improved programming by voice significantly in well over a decade.
Instead, they focus on helping us write emails faster.  How nice.
Look at where the real innovation in this area is coming from.  Is it
from Nuance, or the user community?  For future innovation, where
should we look?

I remember when I was first injured and a friend set me up with a 486, 16 MB of RAM in a lunchbox case that I would carry from customer site customer site using a luggage two wheeled cart. I thought it was so wonderful when I got my first laptop that weighed 10 pounds.

Yes, I don't think NaturallySpeaking is really improved since version 6. It's a little more accurate, more stable, doesn't make Windows puke quite as often but I think all they have done since version 6 maybe seven is fix bugs. However, I will suggest that writing e-mails faster is not a bad thing. I write fiction for a hobby and if I can improve accuracy, I can write more because editing sucks.

when you look at a piece of rough text and try to change it, you really see the lack of inventive or creative effort necessary to make editing easier. Because I don't use speech recognition enabled editors, I can't say something like "select a sentence containing "brilliance of her smile" and have a sentence placed into a dictation box for editing. And yes, I deliberately used an odd number of quote marks because, why do you need to" was on the end of the line in a command mode. Also, it insistence of using the Windows selection mechanism (drag with mouse) makes it difficult to select a small number of words if your hands are like mine. You really want something I can Emacs Mark in point so that you can use a tablet or even a mouse and say "leave market" and "end region". Yes, I left the previous sentence uncorrected just because was too much work to drive the mouse.

I believe innovation comes from people like us. Back in the bad old days of Dragon Systems, disable users would be brought in occasionally to experiment with different interfaces or talk about their experience with the product. I would make some radical changes if I had sufficient hands to write the UI. For example, I would make dictation box with filters on both the input and output so you could modify code to look like English text thereby enabling familiar editing patterns in a dictation box. And I'm output, I would retranslate the text back into code. But also I want plug-ins on dictation box to make it possible to edit other things.

a great example of where this editor can help is in HTML e-mail. I need to generate and receive it should not e-mails when dealing with customers. Yeah, it sucks but it's reality. Thunderbirds editor is a stinking pile of bird poop when editing by hand and even worse by voice. Using a dictation box model as I described above, one could translate HTML or HTML fragments into something one could edit by voice. We could do this without needing to touch the application.I was there.

> I also bought every
microphone that seemed promising at improving recognition rates.  By
the way, what do people feel is the best microphone now days?
there is no one best microphone. We do not have sufficient information to determine which microphone works best with a voice and a computer system with a sound card. You buy microphones until you find one that works best and then you stick with it religiously. I think I said elsewhere, VXI is the only one that works with my voice. As soon as circumstances permit, I'm going to try and get the current Bluetooth headset. The previous one was the most wonderful headset available but unfortunately, the battery charging system, Bluetooth pairing, and I did not get along real well. Something was funky and I had to repair every time I charged which was twice a day. Serious nuisance.
I do a ton of volunteer work for Vinux, which is Linux based on
Ubuntu, customised for the needs of the visually impaired.  People
often post emails saying, "Today I'm switching my main machine to
Vinux!"  I generally suggest that dual-booting, or having Vinux on a
virtual machine is the way to go.  Vinux is not as productive an
environment as either Windows with JAWs or Mac for the blind, at least
not yet.  However, we aim to be better than either.  To get there as
rapidly as possible, I would like volunteers to continue using what
works best for them.  Except Sina.  He should switch to 100% Vinux

that's really cool. Unfortunately, I'm not in position to do a whole lot of volunteering. Need to take care of fundamentals first.

> I agree. you get flamed badly if you suggest people could be more productive
with proprietary tools.  Frankly, it's a bit scary discussing this on
a list.
heh. the way I would manage that particular problem would be to develop self contained components that can be GPL' ed to death and others with more generous intentions could work on the bridge.
However, FOSS seems to be the only way that we can organise many
volunteers from around the globe to work together to write and improve
accessibility tools.  This isn't about ideology or politics or
freedom.  It's about people like us who are fed up with being second
class citizens, and tired of begging for access to new technology.
This is about programmers like us taking control over the future of
accessibility, because we're not going to get what we need otherwise.

and my snarky frame of mind, any collection of thoughts unified by a single purpose is an ideology. It's okay because I think you hit the Crips ideology on the head. handicap accessibility is too important to be owned. we should not put up with being second class citizens and we should own the means of production. Unfortunately, there is a difference between accessibility tools (speech recognition, text-to-speech etc. and the ability to use that accessibility tool with an application or system. I haven't quite figured out a shorthand yet but something like accessibility tools versus accessibility availability is close. we need tools and we need access to other platforms that employers and governments use.

Why not do both in parallel?  There are so many of us, yet each of us
has unique gifts and skills.  Most of us should do as you suggest, and
work at the application level to improve accessibility.  I think some
of us should become SR and TTS experts and work on the next
generation.  Actually, if I didn't have to work so hard with glue and
tape to make Vinux work, SR and TTS is the sort of thing I'd probably
do well at.

You are far more optimistic than I am. My experience try to get Emacs updated and dtach modified for crip use has not been successful at attracting help even though they are far more useful on day one then a new speech recognizer.

As for a pool of experts, we can try mining the OSSRI BOARD OF Directors for possible candidates. That's something we'll have to talk to Susan about.

When I do simple estimates, I just can't see how we don't have enough
potential volunteers to do this.  I just can't believe that 99.9% of
us with RSI injuries or visual impairments are the sort of people to
sit on our butts and do nothing.  From what I've seen, a fair
percentage of us happen to be decent programmers, and are the sort
that refuse to believe we have limitations.

I can unfortunately. Because programming by voice has been so difficult and the hostility of employers to anyone using something like speech recognition in open office plan, many programmers, including myself, have left the field. Some migrated to completely different fields such as bicycle design and others, like myself, have become self-employed as it's the only way to insulate oneself from corporate stupidity and the egregious workloads that injured us in the first place.

Perhaps I have a strong voice, but I spoke non-stop to my computer for
10 hours a day for over three years, and found that all I had to do
was sip water constantly.  I programmed by voice using macros,
eventually writing over 1,600 of them, mostly to control emacs.  I
think it was the best way to continue my career, without giving into
my typing limitations.

You are a very different person than I am. I was able to program in Python using Emacs with less than 50 macros. I could not remember 1600 of them. something about RSI and its treatment messes with your memory. Most developers I've known would not be able to remember 1600 macros as well as the entire body of code they are working with. When I have written code, I have changed how I write classes as a way of accommodating my memory deficits. I also tried to write a small number of macros that were easy on the voice. as I said before also many developers suffer vocal strain at a far lower level of effort than you have put yourself through. memory shortcomings are something else we will need to accommodate. I think this is the driving force behind the methods I've developed for exploring a speech interface. I can't remember what I'm supposed to say next so, the system should prompt me and gave me the ability to navigate within that prompt. the great example is change directory. It's a delightful intellectual exercise as well as demonstration of the flexibility of a discoverable speech interface

I am very interested in ideas like you suggest for enabling
applications without modifications, and doing anything that reduces
vocal and cognitive load.  We need new ideas, and I agree with your
point about not needing another useless type-by-voice project.  Part
of the problem is that many of these projects are funded by well
meaning institutions, but implemented by people interested in research
and their own careers.  I think the code we write would be far better
focused on our own needs.

Okay, this is a conversation why have far more time and possibly one message per topic. Should pop up in the next week or so.

Sorry, but I have to ask: if you can dictate e-mail, why can't you
write code?

that's a real good question. I think the best answer is:

       If it's too difficult to do, it's not worth doing until it's simple.

this is the classic programmer hubris, laziness, arrogance all rolled into one. It's actually design philosophy for me even before I was injured. If it's hard to do, you're doing something the wrong way. You don't understand the problem. You don't even know you're an idiot. When you sit down and answer all of the question the back of your mind creates and manifests as "I'm not comfortable with this" only then should you start thinking about implementation.

Now, I did write Python byte code. I created a Web framework with a markup language that accommodates disabled users. It will work with speech recognition but it will also, theoretically, be accessible to blind, text-to-speech users. It's simple, the current implementation is a bit of a pig but I just wanted to prove the concept of the usability of a disabled user focused markup language.

It's on launchpad under the name "akasha"

Python is the only language I've seen so far that isn't completely hostile to unenhanced speech recognition. I can't manipulate C., Java, or any other language with the same ease. I consider the whole C. language family is so ungodly hostile to speech recognition it's the take a huge interface layer to cross between the two.

I bet you're asking why. An overabundance of special characters with special spacing. I shouldn't have to do that. The environment should know enough about what I'm saying to put things in the right place. Jumble cap misspelled words used for symbols. Again, why should I have to spell that. I should really say the nearest English equivalent and the tool translates. These two features alone will significantly drop the vocal load of programming by voice. They will reduce the cognitive load of trying to remember how to generate that symbol. Done right you will be able to edit a misrecognition in the middle of a misspelled word, possibly even before you inject it into your code. By using the default code and simple style, code generation will be easier on so many levels.

I could say more but I will spare you. :-)
Anyway, you don't have to type code to contribute.  I
would like to hear more about your models.  I'm want to put together
an e-mail list to discuss programming by voice, and the direction we
should take in implementing and improving the tools we need.  Your
input is welcome!  Would it be better to host that e-mail list in
vinux land, or in land?  Regardless, I would like to work in
Vinux to enable programming by voice at some basic level, and then I'd
like to get lots of voice coders on board to make it better.

Models later when I have more time. Probably this weekend coming up. Like I said, there is already a list but, I think I would choose the vinux world as being more culturally/philosophically on board with what we are trying to do regarding accessibility approaches.

I'm out of time for today. I'll try to get back to the rest of this later.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]