[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Accessibility] Can you help write a free version of HTK?
Eric S. Johansson
Re: [Accessibility] Can you help write a free version of HTK?
Mon, 12 Jul 2010 13:03:59 -0400
Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US; rv:22.214.171.124) Gecko/20100608 Thunderbird/3.1
On 7/12/2010 4:24 AM, Bill Cox wrote:
Hi, Eric. You make some good points below.
and I'm glad we are both in agreement on the chunks of this problem. I'll pick
away at responding to these questions because the they all deserve careful answers.
On Fri, Jul 9, 2010 at 2:06 PM, Eric S. Johansson<address@hidden> wrote:
The consensus of this August body was that all of the speech recognition
toolkits out there (Julius HEK, Sphinx) were all designed to keep graduate
students busy but not designed for use in the real world. I did take a look
at Simon, it looks like it's the closest of the bunch but I estimate it
somewhere between 5 to 8 years away from being useful (i.e. on parity with
I agree that these tools are grad-student research oriented. I also
agree that a major rewrite may be required to compete with Naturally
Speaking. However, I disagree that we have to be competitive with
Naturally Speaking to be productive programming by voice. I used
Dragon Dictate in 1996, and later Naturally Speaking. I found that
Dragon Dictate was just about the same as Naturally Speaking in terms
of productivity for writing code. The main problem was that Naturally
speaking would make me pause between commands, like Dragon Dictate,
and only did continuous recognition for dictating text.
I apologize but I'm probably going to go on at length about issues and
programming by voice. I will harp on issues of not damaging a person's throat as
well as vocabulary and name structure after that, I apologize.
My rationale for requiring the equivalent to NaturallySpeaking is twofold. First
is programmatic control over the environment and second is vocal load.
NatuurallySpeaking has far better control over its environment through ssome
toolkit like dragonfly then DragonDictate ever did. I believe that's because of
the environments (Windows versus DOS). Having used DragonDictate, I found it
incredibly wearing on my throat. I had to go to a throat specialist in strain
caused by speech recognition and then had to layoff the product for a few weeks
while while the muscles recovered. This is not an uncommon story. Almost every
single person on the voice coder mailing list has had this kind of event happen.
Discreet utterance speech recognition really does nasty things to your throat as
does non-natural sentence construction in commands with NaturallySpeaking. You
find yourself wanting to say more but you have to tighten up your throat stop
because the grammar doesn't match the way your brain wants to speak.
I know this is random but remember it. One of the things Susan Cragin did in
OSSRI was she managed to get the dougout recognizer relicensed. I think it's
under an MIT/X11 license. I'll CC her on this message (hey Susan, sign up and
say hi to everybody)
In any case, there's one thing to remember about programming by voice. You need
the same vocabulary size for coding as you do for writing. This two reasons for
that which I will let you think about and I will address later.
vocola, Unimacro, dragonfly all do continuous recognition macros. with raw
natlink, you can even do it in Python.It's a hard problem, I agree.
Simon's approach can reduce the active vocabulary at any time to
perhaps a couple of hundred words or less, apparently enabling high
accuracy. If we could have continuous command recognition, we could
easily beat my old productivity. I read that there's a newer tool
called vocola, which enables continuous command recognition with
> Perhaps I'm somewhat wishful in my
thinking, but no matter how I do the calculation, I estimate we have
many times more potential volunteers than such a project will require.
I think the main trick is finding advisors who do have the extensive
knowledge about how to make good recognition engines, and effectively
organising volunteers. I think you probably would agree that
Naturally Speaking is not the only good recognition engine ever sold.
There should be experts around from failed or abandoned efforts who
could help as advisors. Give me one or two of those guys, a dozen
motivated volunteer voice coders, and three years, and I think we
could get there.
I agree about many good recognition engines out there. They all fail in
different ways but they can work. One of the huge challenges we will face is
navigating the patent land mines of other peoples technology. I think this is
one of the reasons why nuance is been on an acquisition been. I believe They're
trying to buy up as many patents as they can to protect themselves against any
market intrusion. one way to defend ourselves would be through our own license
acquisition. For example, look at what Cornell did with video codecs. The
patent license terms said that if it was used in an open-source project, then
there was no charge and no risk of being sued by Cornell for infringement. If we
could build a similar licensing patent portfolio from other players, that might
help us take advantage of pr-existing work versus reinventing the wheel to get
around the patent.
I'm using Google's cloud-computing gmail service to write this e-mail.
I typically review them with a closed-source binary TTS called voxin.
I've been contacted by Skype twice today, and I've watched a couple
flash videos. I think we are in violent agreement on this point.
People with disabilities need solutions, not a philosophy.
wow. I have Google accounts but I rarely use them and certainly not for anything
important. :-) But yes, people with disabilities do need solutions first. We
need to, as Christian philosophy says, teach them how to fish at the same time,
we, as the Buddhist philosophy teaches us, need a right livelihood.
Let's look at where we are. In the early 1990's a tiny company wrote
Dragon Dictate, using the signal processing hardware in the sound card
to make speech recognition on PC's useful for the first time. They're
market was exclusively people with physical impairments. I discovered
them in 1996, when I needed them to remain a programmer. There may
have been some new code written by the community to get around the
crap we get from Nuance, but it seems that the tools they ship hasn't
improved programming by voice significantly in well over a decade.
Instead, they focus on helping us write emails faster. How nice.
Look at where the real innovation in this area is coming from. Is it
from Nuance, or the user community? For future innovation, where
should we look?
I remember when I was first injured and a friend set me up with a 486, 16 MB of
RAM in a lunchbox case that I would carry from customer site customer site using
a luggage two wheeled cart. I thought it was so wonderful when I got my first
laptop that weighed 10 pounds.
Yes, I don't think NaturallySpeaking is really improved since version 6. It's a
little more accurate, more stable, doesn't make Windows puke quite as often but
I think all they have done since version 6 maybe seven is fix bugs. However, I
will suggest that writing e-mails faster is not a bad thing. I write fiction for
a hobby and if I can improve accuracy, I can write more because editing sucks.
when you look at a piece of rough text and try to change it, you really see the
lack of inventive or creative effort necessary to make editing easier. Because I
don't use speech recognition enabled editors, I can't say something like "select
a sentence containing "brilliance of her smile" and have a sentence placed into
a dictation box for editing. And yes, I deliberately used an odd number of quote
marks because, why do you need to" was on the end of the line in a command mode.
Also, it insistence of using the Windows selection mechanism (drag with mouse)
makes it difficult to select a small number of words if your hands are like
mine. You really want something I can Emacs Mark in point so that you can use a
tablet or even a mouse and say "leave market" and "end region". Yes, I left the
previous sentence uncorrected just because was too much work to drive the mouse.
I believe innovation comes from people like us. Back in the bad old days of
Dragon Systems, disable users would be brought in occasionally to experiment
with different interfaces or talk about their experience with the product. I
would make some radical changes if I had sufficient hands to write the UI. For
example, I would make dictation box with filters on both the input and output so
you could modify code to look like English text thereby enabling familiar
editing patterns in a dictation box. And I'm output, I would retranslate the
text back into code. But also I want plug-ins on dictation box to make it
possible to edit other things.
a great example of where this editor can help is in HTML e-mail. I need to
generate and receive it should not e-mails when dealing with customers. Yeah, it
sucks but it's reality. Thunderbirds editor is a stinking pile of bird poop when
editing by hand and even worse by voice. Using a dictation box model as I
described above, one could translate HTML or HTML fragments into something one
could edit by voice. We could do this without needing to touch the application.I
> I also bought every
there is no one best microphone. We do not have sufficient information to
determine which microphone works best with a voice and a computer system with a
sound card. You buy microphones until you find one that works best and then you
stick with it religiously. I think I said elsewhere, VXI is the only one that
works with my voice. As soon as circumstances permit, I'm going to try and get
the current Bluetooth headset. The previous one was the most wonderful headset
available but unfortunately, the battery charging system, Bluetooth pairing, and
I did not get along real well. Something was funky and I had to repair every
time I charged which was twice a day. Serious nuisance.
microphone that seemed promising at improving recognition rates. By
the way, what do people feel is the best microphone now days?
I do a ton of volunteer work for Vinux, which is Linux based on
Ubuntu, customised for the needs of the visually impaired. People
often post emails saying, "Today I'm switching my main machine to
Vinux!" I generally suggest that dual-booting, or having Vinux on a
virtual machine is the way to go. Vinux is not as productive an
environment as either Windows with JAWs or Mac for the blind, at least
not yet. However, we aim to be better than either. To get there as
rapidly as possible, I would like volunteers to continue using what
works best for them. Except Sina. He should switch to 100% Vinux
that's really cool. Unfortunately, I'm not in position to do a whole lot of
volunteering. Need to take care of fundamentals first.
> I agree. you get flamed badly if you suggest people could be more productive
heh. the way I would manage that particular problem would be to develop self
contained components that can be GPL' ed to death and others with more generous
intentions could work on the bridge.
with proprietary tools. Frankly, it's a bit scary discussing this on
a gnu.org list.
However, FOSS seems to be the only way that we can organise many
volunteers from around the globe to work together to write and improve
accessibility tools. This isn't about ideology or politics or
freedom. It's about people like us who are fed up with being second
class citizens, and tired of begging for access to new technology.
This is about programmers like us taking control over the future of
accessibility, because we're not going to get what we need otherwise.
and my snarky frame of mind, any collection of thoughts unified by a single
purpose is an ideology. It's okay because I think you hit the Crips ideology on
the head. handicap accessibility is too important to be owned. we should not
put up with being second class citizens and we should own the means of
production. Unfortunately, there is a difference between accessibility tools
(speech recognition, text-to-speech etc. and the ability to use that
accessibility tool with an application or system. I haven't quite figured out a
shorthand yet but something like accessibility tools versus accessibility
availability is close. we need tools and we need access to other platforms that
employers and governments use.
Why not do both in parallel? There are so many of us, yet each of us
has unique gifts and skills. Most of us should do as you suggest, and
work at the application level to improve accessibility. I think some
of us should become SR and TTS experts and work on the next
generation. Actually, if I didn't have to work so hard with glue and
tape to make Vinux work, SR and TTS is the sort of thing I'd probably
do well at.
You are far more optimistic than I am. My experience try to get Emacs updated
and dtach modified for crip use has not been successful at attracting help even
though they are far more useful on day one then a new speech recognizer.
As for a pool of experts, we can try mining the OSSRI BOARD OF Directors for
possible candidates. That's something we'll have to talk to Susan about.
When I do simple estimates, I just can't see how we don't have enough
potential volunteers to do this. I just can't believe that 99.9% of
us with RSI injuries or visual impairments are the sort of people to
sit on our butts and do nothing. From what I've seen, a fair
percentage of us happen to be decent programmers, and are the sort
that refuse to believe we have limitations.
I can unfortunately. Because programming by voice has been so difficult and the
hostility of employers to anyone using something like speech recognition in open
office plan, many programmers, including myself, have left the field. Some
migrated to completely different fields such as bicycle design and others, like
myself, have become self-employed as it's the only way to insulate oneself from
corporate stupidity and the egregious workloads that injured us in the first place.
Perhaps I have a strong voice, but I spoke non-stop to my computer for
10 hours a day for over three years, and found that all I had to do
was sip water constantly. I programmed by voice using macros,
eventually writing over 1,600 of them, mostly to control emacs. I
think it was the best way to continue my career, without giving into
my typing limitations.
You are a very different person than I am. I was able to program in Python using
Emacs with less than 50 macros. I could not remember 1600 of them. something
about RSI and its treatment messes with your memory. Most developers I've known
would not be able to remember 1600 macros as well as the entire body of code
they are working with. When I have written code, I have changed how I write
classes as a way of accommodating my memory deficits. I also tried to write a
small number of macros that were easy on the voice. as I said before also many
developers suffer vocal strain at a far lower level of effort than you have put
yourself through. memory shortcomings are something else we will need to
accommodate. I think this is the driving force behind the methods I've developed
for exploring a speech interface. I can't remember what I'm supposed to say next
so, the system should prompt me and gave me the ability to navigate within that
prompt. the great example is change directory. It's a delightful intellectual
exercise as well as demonstration of the flexibility of a discoverable speech
I am very interested in ideas like you suggest for enabling
applications without modifications, and doing anything that reduces
vocal and cognitive load. We need new ideas, and I agree with your
point about not needing another useless type-by-voice project. Part
of the problem is that many of these projects are funded by well
meaning institutions, but implemented by people interested in research
and their own careers. I think the code we write would be far better
focused on our own needs.
Okay, this is a conversation why have far more time and possibly one message per
topic. Should pop up in the next week or so.
Sorry, but I have to ask: if you can dictate e-mail, why can't you
that's a real good question. I think the best answer is:
If it's too difficult to do, it's not worth doing until it's simple.
this is the classic programmer hubris, laziness, arrogance all rolled into one.
It's actually design philosophy for me even before I was injured. If it's hard
to do, you're doing something the wrong way. You don't understand the problem.
You don't even know you're an idiot. When you sit down and answer all of the
question the back of your mind creates and manifests as "I'm not comfortable
with this" only then should you start thinking about implementation.
Now, I did write Python byte code. I created a Web framework with a markup
language that accommodates disabled users. It will work with speech recognition
but it will also, theoretically, be accessible to blind, text-to-speech users.
It's simple, the current implementation is a bit of a pig but I just wanted to
prove the concept of the usability of a disabled user focused markup language.
It's on launchpad under the name "akasha"
Python is the only language I've seen so far that isn't completely hostile to
unenhanced speech recognition. I can't manipulate C., Java, or any other
language with the same ease. I consider the whole C. language family is so
ungodly hostile to speech recognition it's the take a huge interface layer to
cross between the two.
I bet you're asking why. An overabundance of special characters with special
spacing. I shouldn't have to do that. The environment should know enough about
what I'm saying to put things in the right place. Jumble cap misspelled words
used for symbols. Again, why should I have to spell that. I should really say
the nearest English equivalent and the tool translates. These two features
alone will significantly drop the vocal load of programming by voice. They will
reduce the cognitive load of trying to remember how to generate that symbol.
Done right you will be able to edit a misrecognition in the middle of a
misspelled word, possibly even before you inject it into your code. By using the
default code and simple style, code generation will be easier on so many levels.
I could say more but I will spare you. :-)
Anyway, you don't have to type code to contribute. I
would like to hear more about your models. I'm want to put together
an e-mail list to discuss programming by voice, and the direction we
should take in implementing and improving the tools we need. Your
input is welcome! Would it be better to host that e-mail list in
vinux land, or in gnu.org land? Regardless, I would like to work in
Vinux to enable programming by voice at some basic level, and then I'd
like to get lots of voice coders on board to make it better.
Models later when I have more time. Probably this weekend coming up. Like I
said, there is already a list but, I think I would choose the vinux world as
being more culturally/philosophically on board with what we are trying to do
regarding accessibility approaches.
I'm out of time for today. I'll try to get back to the rest of this later.