Voice is the most natural way for humans to communicate. It is also difficult for computers to understand. So, we created the mouse and keyboard for computer input, and the display and printer for output.
Smart speakers promised a future when we could interact with computers using only voice control. The Amazon Echo, released in 2014, was the first smart speaker with a virtual assistant, Alexa, that could respond to voice commands.
Amazon Echo, first generation (📷: Amazon)
Amazon’s SVP for Devices and Services, David Limp, said the company’s long-term goal was to recreate the “Star Trek Computer,” an advanced program from the sci-fi franchise that could receive voice commands and perform complex actions.
However, unlike smartphones and personal computers before them, smart speakers failed to catch on. The voice assistants struggled with context and natural language understanding. In 2023, Microsoft CEO Satya Nadella said the current voice assistants, including the company’s Cortana, were “all dumb as a rock.”
Artificial intelligence has evolved since Alexa launched in 2014. Today’s large language models are much better at handling unstructured input and analyzing context. Speech-based human-computer interaction is also progressing, with leading voice models like Sesame AI, Whisper, and Azure AI Speech.
Amazon’s next-generation virtual assistant, Alexa+ (📷: Amazon)
Smart speakers are starting to leverage generative AI models. Google is testing Gemini on Nest speakers, and Amazon is rolling out Alexa+ to early access users.
Microsoft is taking a different approach. They want you to use your personal computer as a virtual assistant.
Hey Copilot (📷: Microsoft)
Microsoft Copilot on Windows now allows you to start a voice conversation with the Copilot chatbot using a simple wake word: “Hey Copilot.” Announced in a blog update last week, this will let you “stay in your flow when you need answers to a question or just need someone to bounce an idea off of.”
When “Hey Copilot” is enabled, the app uses an “on-device wake word spotter” to detect the phrase and begin a Copilot Voice conversation. The feature is currently only available to Windows Insiders with their display language set to English.
Hackster member Jen Fox, principal program manager at Microsoft CoreAI, says the wake word is a “critical piece of conversational voice [AI] because it allows for hands-free invocation of voice mode, which means you can talk to your computer without having to stand at it.”
Jen Fox, principal PM at Microsoft and founder of FoxBot Industries
Fox is big on voice interaction and envisions “being freed from the desktop to engage with the physical world, the people and other creatures in it.” The wake word, according to her, will offer a useful hands-free experience for people who need to work on the go or retrieve information without interrupting their workflow.
It will also benefit people “who cannot use, or struggle with, existing input/output devices because conversational voice and voice invocation will make it much easier to use computers using only voice.”
“Hey Copilot” might remind you of the wake word for Windows’ previous digital assistant, Cortana. Copilot replaced Cortana in 2023 and was promised to be more context-aware and useful overall.
Today’s AI assistants are no longer as “dumb as a rock,” but they still fall short of actual intelligence. They can hold conversations and retrieve information quickly, but perform much better with text than actions. They are also prone to unpredictable outputs or “hallucinations” due to gaps in training data.
Overall, voice control is unlikely to replace the keyboard and mouse setup any time soon, and Fox agrees:
“We speak differently than we type, so if we’re writing a paper, we may start with a voice-based draft and use an AI assistant to do some editing, but it’s likely we’ll need to go in with a keyboard to really get our ideas flushed out and polished.”
She adds, “It may take a generation once we have voice and gesture-based controls,” and it will be more intuitive for those who grow up with it. Keyboards and mice will still be needed, “even when voice can trigger more complex actions and handle end-to-end workflows.”
Just as the USS Enterprise had physical control panels and visual displays, there will always be a place for the traditional desktop setup.
Microsoft’s Azure AI Foundry offers over 1900 models from Azure OpenAI, DeepSeek, Microsoft, NVIDIA, and Meta, including text-to-speech and speech-to-text models, for testing and building generative AI applications.