I thought about starting this article about speech-recognition technology
with a lament about my long-suffering wrists, which, thanks to my excessively
speedy -- and prodigious -- typing, have lately been in a state of
anguish. I debated sentences like "The latest voice software offers a
feasible solution for those suffering from carpal tunnel syndrome" or "I wrote
this entire article with Dragon NaturallySpeaking, and not once did my hand
touch the keyboard."
But I didn't write this entire article with speech-recognition
software -- though I did compose bits and pieces of it using my new Dragon
NaturallySpeaking program (a bonus prize to readers who can figure out which
sentences were voice-activated). That, in itself, is indicative of just how
far speech-recognition technology still has to go before we're all using it
as our everyday computer interface.
To be blunt, the four days I spent testing voice-recognition software programs added up to perhaps one of the most frustrating experiences I have ever had with a computer. I spent hours repeating myself over and over, talking in the peculiarly stilted manner that voice-recognition software forces upon its users, waiting for words to spill across the page and catch up with my own garbled speech. It's not a task for those low on patience -- I discovered quite quickly that Dragon NaturallySpeaking knows how to spell "God damn it"
but has a hard time with "fucking computer" -- and it takes great will to
not exasperatedly give up and type instead. If I -- someone who has a vested interest in using this technology -- can't put up with the software, how good can it be?
Yet speech-recognition technology has dramatically improved, and
continues to do so. The software that is available now -- programs like
Dragon NaturallySpeaking Preferred, IBM ViaVoice, and Lernout & Hauspie's Voice XPress Professional -- are vastly better than the software that was available just a few years ago. Further progress is inevitable; researchers like professor Theodore Berger at the University of Southern California are already coming up with new breakthroughs that promise astonishingly accurate results in the near future. And speech interfaces are, slowly but steadily, creeping into our lives.
The market for speech technology has more than doubled in the last
two years -- speech-recognition software sales in 1997, according to
research firm PC Data, hovered around the 200,000-unit mark; in 1999 so far, more than 522,000 units have already been sold. So even if I didn't have the patience today to use speech technology to input this entire story, I have no doubt that someday I will. Experts predict that within the next five years (if not sooner), we will merely dictate phone numbers and addresses into our personal digital assistants, instead of typing them in. We'll ask our house to turn on the light for us (no clapping necessary!) and casually tell our computer to download the front page of the New York Times and print it. We'll ask our cell phone to call the doctor; our car to tune the radio to NPR; our VCR to record the next "X-Files" episode. We won't need a keyboard, stylus, mouse or even our fingers -- just a mouth.
"Any device that human beings interact with has a potential for speech technologies -- speech in itself is a natural mechanism for people to
interact with, far more than keyboards and mice," explains Tom Morse, senior
director for telecom engineering at Lernout & Hauspie. "We see it not as a
way to replace other interfaces, but to augment them."
Speech-recognition technology has been in development since the 1970s, but
only in the last two years has the software become truly viable for everyday
consumers. Chris Carrigg, a speech-recognition expert and director of
business development for the speech training company Say I Can, explains: "Up
until two years ago you ... had ... to ... talk ... like ... this." Speech-recognition software, says Carrigg, used "discrete speech models" which could only
parse one word at a time. "Dragon NaturallySpeaking was the first to come
out with a natural speech program. Before that, it was so tedious to use
that the only people interested in it were disabled users who had to
With the advent of continuous speech recognition -- which began appearing in
commercial products about two years ago -- software has now learned to
recognize natural talking patterns, allowing users to dictate in their normal
voice. Lernout & Hauspie's Voice XPress software, for example, uses a
statistical mapping model with language matching and word pairing to gauge
whether words fit together; essentially playing a guessing game with
unidentifiable words to determine whether they fit into the sentence you
The early adopters of speech-recognition software were, not surprisingly,
those suffering from hand injuries or otherwise incapable of typing -- journalists with repetitive stress injuries, for example. Doctors, lawyers and others in dictation-intensive professions picked it up next: Radiologists who needed to dictate notes into a recorder while peering through a microscope would instead talk into a speech-recognition device that plugged into the computer, and lawyers used the software to transcribe their endless
legal documents. The software companies have been catering to these niche
markets with products that boast legal or medical vocabularies.
Speech-recognition software, however, isn't yet making a major splash with
everyday computer users; instead, it's still a niche product that is being
used by those who have a pressing need. It isn't that the products are
expensive; most start at $59 for a basic version. In all probability, many potential customers are intimidated by the awkwardness of a new interface and the time commitment involved in making it work. And like I said earlier, it's
still far from perfect software: I spent time practicing with two speech-recognition products, Dragon and Voice Xpress, and was both impressed and frustrated by the experience.
Using speech-recognition software is a two-way street: Not only must you
learn how to use the software; the software has to learn how to use you.
Explains David Nahamoo, director of research for human language technologies
at IBM, "First, you need to become familiar with the conversational
interface -- being able to actually talk to a system and understand what it
takes to interact with a machine through speech. Secondly, the machine has to
become used to and customize itself to the way that you ... are using
The actual process of training these two products (and almost all speech-recognition software products) is quite similar -- you'll spend roughly a
half-hour setting up your computer system and headset and measuring
microphone and voice levels before moving into a training period. To train
the software, you read documents aloud (in my case, snippets from "Alice in
Wonderland") for anywhere from five minutes to a half an hour, while the
software learns to recognize your voice -- a process called "enrollment."
(With some products, you can also upload documents that contain your typical
vocabulary, so that the software gets a sense of your writing style.) Then
you can start dictating documents.
Nahamoo idealistically estimates that a good software program will optimize
itself -- or as he puts it, "hit a plateau" -- within two to three hours of usage. The idea is that the more you use the software, the more it will understand your voice patterns, and the better it will perform. Sure enough, after using the software for several days, I saw a definite improvement -- although that was after four days, not two to three hours.
All of these products boast accuracy of 90 percent on up; but getting to that optimal recognition is a tricky, painful process -- in fact, there are entire books dedicated to explaining how to use the software correctly. Yes, these products can quite accurately transcribe your words, but only after you've mastered the ins and outs of proper dictation, specific commands and the oddities of voice-activated computer controls.
This can be a major time commitment, as I learned; and even when the software is operating at its optimum performance levels, it will still get one out of every 10 words (or so) wrong. I used the Dragon software for four days, and it was an error-ridden process even after endless hours of corrections and careful dictations. For every sentence that I breezily dictated, I had to spend another minute or so attempting to delete the one mistake.
To correct an
error midway through a sentence, for example, you use a string of commands:
"Select error," "Scratch that," "Delete previous character," "Move to end
of sentence." With each of these commands, there's also a chance that
the software will mis-hear you and accidentally transcribe the command --
"motorcycle penance" instead of "move to end of sentence" -- into your
sentence, necessitating yet another string of corrections.
In addition, even on my zippy new Pentium machine, there was a lag of a few seconds while the software tries to interpret your words -- and for me, at least, it's much faster to just type.
(Of course, I'm an unusually fast typist; those who are less speedy might find that speech software is much quicker than the old "hunt and peck" method.)
There are countless other small frustrations. The Voice
Xpress software, for example, seemed to be very sensitive about my
microphone and sound card drivers; although I got the software working on
one PC, I had problems with installing it on two other PCs. Another niggling
annoyance: You can't eat and dictate at the same time. Sure, you won't get
grease on the keyboard, but the crunching from your Fritos is picked up by
the microphone and appears in your text as some rather mysterious words.
The software is supposed to automatically adjust its microphone levels to
your environment, and screen out meaningless white noise. But my Dragon software did pick up the background noise of my office: The loud banter in the next cube showed up in my documents as gibberish. (When I accidentally left the microphone on while I went out to lunch, I came back to discover a stream-
learned early on that all of my office-mates can hear every word I say -- and
it's difficult to be a linguistic maestro (or to compose personal e-mails)
when you know everyone around you is listening. For that matter, I'm sure my
constant patter -- and swearing -- has been driving them nuts too.
Most important, as a journalist, it's not easy to compose an article
orally -- it's a bizarre feeling to verbalize sentences rather than let
words fall from your fingertips. Writing becomes a tedious, yet thoughtful,
act; you must think the whole sentence out before you say it, and be precise
in your speech -- and proper enunciation is rare in this age of mumbling. If
you aren't careful, it'll be an awfully slow process: Just the last two
sentences alone cost me three minutes of "scratch that" and "select
that" and "move to end of sentence."
In fact, using speech-recognition software can stunt the creative writing
process -- you end up feeling like a computer program, thinking in short
phrases with your voice as the command line. The natural cadence of my
sentences instead came out stiff and dry; my complex thoughts were
interrupted by a constant need to correct the mistakes the program had made.
I felt like an automaton; not an author.
This is a problem the software creators have witnessed, too. "In the case
of creative writing, we are noticing some of the challenges -- that
challenge is really designing an interface for composing where it's as
natural as possible," says Nahamoo. As it is, he says, users must more
carefully think out what they want to compose before they verbalize it --
which isn't necessarily a natural way of speaking in our rushed age.
But regardless of my complaints, the software does have a big upside. It's a
blessing to not have to use your hands; and you can lean back in your chair
with your eyes closed while you compose (as long as you open your eyes every
few sentences to make sure that your dictations weren't boffed). Overall,
it's far less stressful on your body; and it doesn't hurt your enunciation
Best of all, you don't have to worry about the proper way to spell
"accommodate"; the software automatically spells it correctly for you. Despite my
impatience with the program, I eventually came up with a solution that
seemed to satisfy even my need for speed: using my voice to dictate, and my mouse to navigate and make corrections. It's not as zippy as typing, but it still saves my wrists.
Despite the current hurdles, speech technology will undoubtedly
improve. This is a system that demands perfection -- and as long as
speech recognition runs even at only 98 percent of its potential, researchers will
continue to devise new and innovative solutions. "People stop using systems that could be helpful ... when they don't work just 10 percent of the time," says Theodore Berger, director of USC's Center for Neural Engineering, who is currently researching speech-recognition solutions. "If they are 90 percent correct, you've
still got to correct 10 percent of the characters. That's enough to make you stop."
Berger recently released data about a new, neural-
co-developed, which appears to boast better-than-human word recognition. By
using a system that mimics the brain's electrical processes -- varying times
between signals to allow temporal sensitivity to patterns -- Berger says
that he's enabled the software to more easily recognize the commonalities
between words as they are spoken by different people or in different
acoustic environments. Berger says his studies show that on a basis of just two to three reference points, the system can discern words in white-noise levels that are up to 560 times the strength of the actual word signal.
Berger trained his system to recognize only a paltry 12 words, but in tests his
software's recognition "beat the pants off Dragon," he says. "Dragon isn't
particularly bad; it's just that with these speech-recognition dictation
systems, as soon as the background gets slightly noisy they fail."
With such a limited vocabulary, it's not likely you'll be seeing Berger's work in a Dragon-like PC product any time soon, but many groups -- including car manufacturers, electronics companies and the Defense Department's Advanced Research Projects Agency, which helped fund his research -- are already eyeing the technology for commercial products.
IBM's Nahamoo expects that "by the year 2002, more than 30 percent of the
workforce will be using a speech technology during their daily work." And dictating texts is only one of the possible uses for this technology -- you can already spy basic speech-recognition technology on the other end of your telephone, for example. Speech recognition is used by the WildFire voicemail system, which assists users in the navigation of convoluted menus by voice commands, and local telephone company directory services, which speed up their processes by taking search requests via automated answering services.
The possibilities are endless, says Morse of Lernout & Hauspie. "The
automobile -- that's where there's a lot of excitement for speech right
now," he explains. "The amount of time we spend in cars now is so much
greater than even 10 years ago ... so car manufacturers are trying to
provide you with more things to do in the car, entertainment-based or stuff
to allow you to work while you're driving home from work."
Morse envisions a car that can download e-mail, read it to you and let you dictate and send
replies, all while navigating rush-hour traffic. In fact, some luxury
automobiles are already starting to incorporate this technology to allow
you to control your stereo or heater by voice -- in the form of Microsoft's AutoPC program.
Nahamoo is equally excited about the potential for personal digital assistants:
"When the computer elements we deal with are much more mobile,
conversational interfaces will play a very critical and important role --
especially if we have little devices where there's no real estate for
keyboard and keypads." Forget mastering that clunky Palm Graffiti handwriting-recognition software -- just tell your PDA what to do. IBM is currently spearheading a group called VoiceTIMES that hopes to create standard specifications for handheld mobile devices, so that speech-recognition systems will work across various portable platforms.
Meanwhile, there are the typical technical hurdles: We need hardware advances
like stronger, cheaper chips with more processing power and bigger memory
caches, to make the technology not only fast but affordable. We need more svelte
software, with better accuracy and more intuitive understanding of human
language. And, as mothers and teachers around the world will probably
rejoice to hear, we humans also need to learn to speak more clearly and
accurately, so that our machines can understand us.
A good pair of earplugs might also be a useful investment, too -- if
everyone starts speaking to the machines that surround us, instead of
pushing buttons or flipping switches, imagine the cacophony of commands that
await us. We may save our hands, our wrists and our fingers, but what about our ears?