At the age of 45, Casey Harrell lost his voice to amyotrophic lateral sclerosis (ALS). Also called Lou Gehrig’s disease, the disorder eats away at muscle-controlling nerves in the brain and spinal cord. Symptoms begin with weakening muscles, uncontrollable twitching, and difficulty swallowing. Eventually patients lose control of muscles in the tongue, throat, and lips, robbing them of their ability to speak.
Unlike paralyzed patients, Harrell could still produce sounds seasoned caretakers could understand, but they weren’t intelligible in a simple conversation. Now, thanks to an AI-guided brain implant, he can once again “speak” using a computer-generated voice that sounds like his.
The system, developed by researchers at the University of California, Davis, has almost no detectable delay when translating his brain activity into coherent speech. Rather than producing a monotone synthesized voice, the system can detect intonations—for example, a question versus a statement—and emphasize a word. It also translates brain activity encoding nonsense words such as “hmm” or “eww,” making the generated voice sound natural.
“With instantaneous voice synthesis, neuroprosthesis users will be able to be more included in a conversation. For example, they can interrupt, and people are less likely to interrupt them accidentally,” said study author Sergey Stavisky in a press release.
The study comes hot on the heels of another AI method that decodes a paralyzed woman’s thoughts into speech within a second. Previous systems took nearly half a minute—more than long enough to disrupt normal conversation. Together, the two studies showcase the power of AI to decipher the brain’s electrical chatter and convert it into speech in real time.
In Harrell’s case, the training was completed in the comfort of his home. Although the system required some monitoring and tinkering, it paves the way for a commercially available product for those who have lost the ability to speak.
“This is the holy grail in speech BCIs [brain-computer interfaces],” Christian Herff at Maastricht University to Nature, who was not involved in the study, told Nature.
Listening In
Scientists have long sought to restore the ability to speak for those who have lost it, whether due to injury or disease.
One strategy is to tap into the brain’s electrical activity. When we prepare to say something, the brain directs muscles in the throat, tongue, and lips to form sounds and words. By listening in on its electrical chatter, it’s possible to decode intended speech. Algorithms stitch together neural data and generate words and sentences as either text or synthesized speech.
The process may sound straightforward. But it took scientists years to identify the most reliable brain regions from which to collect speech-related activity. Even then, the lag time from thought to output—whether text or synthesized speech—has been long enough to make conversation awkward.
Then there are the nuances. Speech isn’t just about producing audible sentences. How you say something also matters. Intonation tells us if the speaker is asking a question, stating their needs, joking, or being sarcastic. Emphasis on individual words highlights the speaker’s mindset and intent. These aspects are especially important for tonal languages—such as Chinese—where a change in tone or pitch for the same “word” can have wildly different meanings. (“Ma,” for example, can mean mom, numb, horse, or cursing, depending on the intonation.)
Talk to Me
Harrell is part of the BrainGate2 clinical trial, a long-standing project seeking to restore lost abilities using brain implants. He enrolled in the trial as his ALS symptoms progressed. Although he could still vocalize, his speech was hard to understand and required expert listeners from his care team to translate. This was his primary mode of communication. He also had to learn to speak slower to make his residual speech more intelligible.
Five years ago, Harrell had four 64-microelectrode implants inserted into the left precentral gyrus of his brain—a region controlling multiple brain functions, including coordinating speech.
“We are recording from the part of the brain that’s trying to send these commands to the muscles. And we are basically listening into that, and we’re translating those patterns of brain activity into a phoneme—like a syllable or the unit of speech—and then the words they’re trying to say,” said Stavisky at the time.
In just two training sessions, Harrell had the potential to say 125,000 words—a vocabulary large enough for everyday use. The system translated his neural activity into a voice synthesizer that mimicked his voice. After more training, the implant achieved 97.5 percent accuracy as he went about his daily life.
“The first time we tried the system, he cried with joy as the words he was trying to say correctly appeared on-screen. We all did,” said Stavisky.
In the new study, the team sought to make generated speech even more natural with less delay and more personality. One of the hardest parts of real-time voice synthesis is not knowing when and how the person is trying to speak—or their intended intonation. “I am fine” has vastly different meanings depending on tone.
The team captured Harrell’s brain activity as he attempted to speak a sentence shown on a screen. The electrical spikes were filtered to remove noise in one millisecond segments and fed into a decoder. Like the Rosetta Stone, the algorithm mapped specific neural features to words and pitch, which were played back to Harrell through a voice synthesizer with just a 25-millisecond lag—roughly the time it takes for a person to hear their own voice, wrote the team.
Rather than decoding phonemes or words, the AI captured Harrell’s intent to make sounds every 10 milliseconds, allowing him to eventually say words not in a dictionary, like “hmm” or “eww.” He could spell out words and respond to open-ended questions, telling the researchers that the synthetic voice made him “happy” and that it felt like “his real voice.”
The team also recorded brain activity as Harrell attempted to speak the same set of sentences as either statements or questions, the latter having an increased pitch. All four electrode arrays recorded a neural fingerprint of activity patterns when the sentence was spoken as a question.
The system, once trained, could also detect emphasis. Harrell was asked to stress each word individually in the sentence, “I never said she stole my money,” which can have multiple meanings. His brain activity ramped up before saying the emphasized word, which the algorithm captured and used to guide the synthesized voice. In another test, the system picked up multiple pitches as he tried to sing different melodies.
Raise Your Voice
The AI isn’t perfect. Volunteers could understand the output roughly 60 percent of the time—a far cry from the near perfect brain-to-text system Harrell is currently using. But the new AI brings individual personality to synthesized speech, which usually produces a monotone voice. Deciphering speech in real-time also lets the person interrupt or object during a conversation, making the experience feel more natural.
“We don’t always use words to communicate what we want. We have interjections. We have other expressive vocalizations that are not in the vocabulary,” study author Maitreyee Wairagkar told Nature.
Because the AI is trained on sounds, not English vocabulary, it could be adapted to other languages, especially tonal ones like Chinese. The team is also looking to increase the system’s accuracy by placing more electrodes in people who have lost their speech due to stroke or neurodegenerative diseases.
“The results of this research provide hope for people who want to talk but can’t…This kind of technology could be transformative for people living with paralysis,” said study author David Brandman.