Voice UIs: Science Fiction, No Longer

While the rest of the world, from science fiction movies to Bill Gates, has been bullish about voice as a computing interface, I’ve been actively hostile to the idea for many years.

Apple, Microsoft, and many others have tried to make speaking – the communication channel we learn the earliest and use the most – the means by which we interact with computers.

And yet, it has never seemed to come to pass.

Even Siri, OK Google, Cortana and other assistants, optimised for voice and available on devices people not infrequently speak into (and painful to use in a text-based fashion) have only seen modest use of voice, albeit growing.

But I’m much more interested in the intersection of voice and computing, these days.

I’ve realised that it wasn’t voice per se that I was strongly negative about, but voice as a replacement for typing, clicking and tapping of GUIs (and more direct input UIs we find on touchscreens these days).

My classic example of why such approaches to Human Computer Interaction are – frankly – so stupid, is this scene from Blade Runner.

It takes an impossibly young looking Harrison Ford (was this really that long ago? I saw this at the cinema on initial release!) over two minutes to pan around an image and zoom in on one part, a process that with a mouse or phone or tablet touchscreen today would take a couple of seconds. It also requires him to translate his intuitions into computing terminology (coordinates in a cartesian plane for the benefit of the computer).

Time and again, when I see voice interfaces to vision based HCIs, I think of this scene. Which is science fiction!

But, increasingly, I think the intersection of voice and computing (I’m choosing this somewhat ambiguous term quite deliberately) is emerging as important.

Not voice as a deliberate mechanism for controlling computers – except in some limited cases as in home devices like the Echo, or in-car activation of entertainment systems – but rather as computers listening to our conversations, to our lives (either all the time, if we wish, or at more specific times), and based on a computed sense of our sentiment, levels of stress, the content of our words, the other context of our lives right now (who are we with, where are we right now? where are we going? where have we been?) make our lives better.

It is, of course, an advertiser’s dream come true, and in countless science fiction futures like that of Minority Report we’ve seen the idea of individually tailored advertisements seen only by the intended recipient.

We also see that these ads need to shout over one another attempting to gain his attention.

But there is an opportunity here, that isn’t science fiction.

Our devices can listen and use APIs to translate speech to text, to recognise who’s speaking, to extract sentiment and meaning from our words, to translate into other languages, and to whisper back into our ears in increasingly human-like tones.

What will come of it?

I’m completely in the dark. Just as Sinclair and Wozniak and Jobs and others were at the dawn of what became the personal computer.

But I have a feeling the opportunities are extraordinary.

Seize them.

Your opinion:

XHTML: You're allowed to use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <s> <strike> <strong>