Interaction design for virtual assistants

Published in

UX Collective

11 min readJan 7, 2019

This is the year smart speakers boomed.

What’s a smart speaker? Wikipedia describes them as “a type of wireless speaker and voice command device with an integrated virtual assistant.” Think Amazon Alexa, Google Home (and Assistant), or Apple Homepod (and Siri).

According to TechCrunch smart speaker ownership almost doubled in 2017, with 41% of US households now owning one of the big three devices.

I saw solid proof of this myself over the recent Christmas holidays; they were hands-down the most popular gift among family. No less than three of us found them under the tree. Two took home Google Home Minis, and I took home an Alexa. This offered an eye-opening experience into what’s rapidly becoming mainstream new technology.

The changing face of the interface

Speakers have existed for ages. What’s notable about smart speakers is their virtual assistants and how we interact with them.

Most of us use a range of personal devices, and our relationship to them has always been one of human to tool. Our phones, tablets and computers act as direct interfaces between us and the world, making them an extension of “I.” I shop online. I search Wikipedia. I text friends. But when it comes to smart speakers that changes. We interact with them by speaking to their virtual assistants (Alexa or Google or Siri), and having the assistant do things for you.

These assistants change the nature of design in fascinating ways. They have gender. They have names. They refer to themselves as “I”. They even have a touch of personality. You converse with them to achieve a goal. Alexa orders things online for you. Google looks up Wikipedia articles for you. Siri sends that message. And what does that mean? It means that there’s suddenly a human to (almost) human relationship in play, and all of a sudden some social and conversational rules apply to interface design. How cool is that?

As well, there are some specific usability considerations that come with the conversational interface. These aren’t unique to voice interactions, but they’re definitely exacerbated in a vocal exchange.

Let’s look at what I mean. Based on a couple weeks of playing with Alexa and Google, here are a few lessons and resulting best practices I’ve seen for virtual assistant interaction design.

Talk like a human

Imagine this: You’re interacting with a stranger. They say “thanks!” What do you say? Probably “you’re welcome” or “no problem” or something like that. That’s just polite, right?

Conversely imagine you’re working on a project with a partner or coworker or parent — someone you’re comfortable with. You say “hand me the stapler.” With a stranger that would probably seem curt or rude, but the same rules don’t apply in the easier interactions of friendship, family, or other familiars. “Hand me that” works, and in all likelihood you don’t expect much response (other than the stapler), because you’re operating on the skeleton of a conversation. In fact it’s not really a conversation, it’s more of a transaction. Minimal conversation is okay between familiar people; even silence is comfortable.

Human conversations vary. The same basic communication can take different forms depending on the relationship of the people talking. They can be wordier and more formal between strangers, or more concise between friends. Some people add polite extras in all their interactions; others use them only selectively.

Where things get awkward is when these get mixed. Imagine this interaction between strangers on a street corner:

“Excuse me, could you tell me where to find the nearest Starbucks please?”
“Over there.”

Sounds almost rude, right? But try the same basic interaction with a different starting prompt:

“Hey, is there a Starbucks here?”
“Over there.”

Now there’s a better match between question and answer. It might not be a pleasant conversation but it’s a reasonably well-matched interaction and so it doesn’t seem nearly as rude.

How is this relevant to voice interactions with a virtual assistant like Alexa? Simply put, it’s good practice to recognize that people will interact with a virtual assistant using a range of conversational approaches. While there’s generally a command involved, there’s also a conversational wrapper. It may be minimal or it may be more wordy. Some people will simply bark “Alexa, weather” while others will go with something more like “Alexa, what’s the weather like?” or “Alexa, what’s it like out?”

Obviously when designing a new skill (a set of abilities) for a virtual assistant you need to explore all the different ways people are likely to ask for help. But good interaction design needs to consider the responses too.

Going back to our Starbucks example we land on this idea: People are most comfortable when they get a response that matches the tone and formality of their request. As designers this means it’s good practice to try and match the ‘chatiness’ and conversational approach of the user. If they’re terse and transactional, provide responses that get right to the point. If they’re a bit wordier and use pleasantries like “please” and “thank you,” give them responses that mirror this tone. This will create a more natural, human-feeling interaction.

Alexa recognizes “thank you” and responds occasionally with “you’re welcome.” I’m pretty sure I’ve heard Google respond with “no problem.” These responses are there when they make sense, absent when they don’t. This is mirroring, and it creates a more comfortable and natural interaction between human and assistant.

In essence, when writing interaction scripts mirror the user’s speech style. Support utterances (user prompts) that cover a range of possibilities, from terse and command-focused to wordy and polite. For a bit of pleasantness add responses for pleasantries like “please” and “thank you.” In short: Talk like a human.

Skip the welcome

During my first week of learning to use Alexa I added new skills daily. Skills are essentially apps that you can add to Alexa to expand the repertoire of things she can do. They’re little packages of add-on abilities. Some I kept, some I didn’t. The one consistent thing in those that didn’t make the cut? They insisted on welcoming me.

For example, I was super stoked to connect Alexa with my Roomba so that I could dispatch the little beast to tidy up with a quick voice command. That’s living in the future! But this is what I heard every time I triggered Alexa’s new skill:

Welcome to iRobot Home for wifi-connected Roomba robot vacuums. I can help you start, pause, and end a cleaning job. Just say start cleaning or end cleaning. What would you like to do?

Every. Single. Time.

Why is this a problem?

There are four sentences. The first one is fluff. The next two are relevant the first time you use Alexa to connect to your Roomba, but after that you can pretty much dispense with all but the last sentence. There’s a lot of extraneous dialog.

Extraneous dialog is the audio equivalent of an app loading screen. It’s non-functional and there’s no way to skip it, which translates as a delay, an interruption before you can get on with your intended action. That’s a problem, but it’s only problem #1.

Problem #2 is that this sort of welcome breaks the relationship between the user (me) and their device (Alexa). Unlike our phones and tablets and laptops, assistants like Alexa and Google strive to seem almost human. They listen and talk to us like people. They have some personality. This creates a relationship very unlike those we have with other gadgets. Encountering a sudden welcome message breaks that carefully crafted illusion of humanity. Your little assistant/companion/friend suddenly speaks with someone else’s voice. It’s jarring.

If this served a purpose it might be forgivable, but see problem #1. That “welcome to iRobot Home…” message does nothing but reinforce another brand, and this is one place where third party branding isn’t really appropriate.

When designing a new skill for a virtual assistant it’s important to preserve the assistant’s identity and perceived humanity. It’s also important to keep dialog lean enough to keep interactions relatively quick.

Establish trust for the unseen

Voice interactions with technology are a new experience for many. With newness comes uncertainty. The first time I wanted Alexa to set a timer for 60 minutes, I ran smack into that uncertainty because I couldn’t see the results of my action. Would it work? Could I trust her to alert me in an hour?

The same was true the first time I wanted to set an alarm, add a reminder, etc, etc, etc… Basically there was uncertainty (and nervousness) endemic to each new interaction because I couldn’t see the results of my actions.

The unseen is hard to trust.

The equivalent of this in an app or on the web might be a button that doesn’t do anything when you click it. The user clicks, nothing happens, and they wonder “did that work?”

As designers we’ve landed on a common practice to forestall uncertainty: User feedback. We cue the use that their action was successful with any of a range of options. Maybe we change the copy on the button when the user clicks on it. Maybe we change its colour. Maybe we show a success toast or a cute animation. There’s a huge range of ways to signal that an action was successful and in doing so remove uncertainty for the user.

Going back to voice, we don’t have the same visual options to signal success, but user feedback is still both critical and possible — we just need to use voice to provide it.

What got me to trust Alexa’s timers? I’d say “Alexa, set a timer for five minutes” and she’d respond with “Five minutes, starting now.” Four words that confirmed she’d heard me and repeated what she heard so I could be sure she’d got it right.

Alexa, remind me to start dinner at 6pm.
Okay, I’ll remind you tomorrow at 6pm.

But wait, there’s more. Adding user feedback to voice interactions doesn’t just remove uncertainty and help build trust in the assistant; it also helps with error correction. As good as they are, voice assistants aren’t always 100% accurate in interpreting what we say.

Alexa, play Ambient Chill.
Playing Chill Out Music from Spotify.

Oops, wrong playlist.

Designing interactions to reiterate the original request lets the user know if there’s been a mix-up, so they can correct it immediately. This makes inevitable goof-ups less annoying and less critical, which in turn helps build the human:assistant relationship.

Be memorable

Ever had someone tell you their phone number and struggled to remember it long enough to write it down?

That’s your working memory in play, your short term information cache. Jakob Nielsen writes “short-term memory famously holds only about 7 chunks of information, and these fade from your brain in about 20 seconds.” It’s limited, and it’s a major consideration for voice interaction design.

Why?

In an app or on a website, information exists and persists on the screen until the user moves on. This means that if there’s a lot of information being shared the user doesn’t have to remember it all — it’s right there on the screen. They can focus on acting on the information rather than holding it in memory.

Voice interactions don’t have that. When a virtual assistant has a lot of information to impart, the user has to remember everything said long enough to act on it. For example, imagine you want to see a movie but need to know what’s playing and when. Looking this up on your phone you might get a fairly long list:

Aquaman
Bumblebee
Spider-Man: Into the Spider-Verse
Holmes & Watson
Escape Room
Vice
Bohemian Rhapsody
Fantastic Beasts: The Crimes of Grindelwald
Creed II
If Beale Stree Could Talk
Vox Lux

If you ask Alexa or Google Assistant what’s playing they need to give you that same information, but you might find it challenging to remember all eleven options long enough to choose between them. What was the second option again? What came after “Vice”? Was it “Escape Tomb” or “Escape Room?” If you were skimming listings on your phone you could just scan backwards to recall the information; with a virtual assistant you can’t skim the last few words. You have to remember. And human memory is limited.

Working memory can be a pretty severe limitation in voice interactions.

Fortunately we have ways of designing for memory limitations, and the same methods we use in app or web design translate well to voice interaction design. The big one is information chunking.

Which of these numbers is easier to remember?…

804023111479
8040 2311 1479

For most people the second, chunked number is easier to remember than the first because it’s been split into a few smaller numbers that don’t push the limits of working memory.

You can see this approach used in some of the built-in Alexa and Google Assistant skills. Ask them for movie times and they’ll list a few, then ask if you want to hear more. This lets you make choices based on smaller sets. Google takes it a step further — Google Assistant asks what you’re in the mood for and tries to narrow down the available options based on what interests you. Answer “science fiction” and you immediately narrow down the list from eleven to two (Spider-Man and Bumblebee).

Obviously not all voice interactions involve long lists or reams of information, but when they do we need to help users (and assistants) handle that influx of information.

In a nutshell…

The best practices for designing voice interactions aren’t really different from best practices for other UX design, but there’s definitely more demand on memory, more difficulty in error correction, and a lot of uncertainty for new users. More attention needs to be given to these aspects than in your average web or app planning.

What’s more, virtual assistants create the illusion of humanity and in doing so create a very different relationship with users than other devices. It’s personal. They offer a different (and very human) type of interface — conversation. We have existing norms and expectations for conversation, and some of these now apply to virtual assistant voice interactions.

There’s been a huge surge in smart speaker and virtual assistant use in the past year,. As the underlying technologies improve we’ll no doubt see those numbers go even higher. As UX designers it’s worth exploring which principles we can borrow from other areas of UX design, and what’s new and unique to voice. It’s a brave new world.

Alexa, say goodnight.